CN105701013A

CN105701013A - Software defect data feature selection method based on mutual information

Info

Publication number: CN105701013A
Application number: CN201610004279.XA
Authority: CN
Inventors: 李克文; 邹晶杰
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2016-01-04
Filing date: 2016-01-04
Publication date: 2016-06-22

Abstract

The present invention belongs to the field of software engineering, and specifically relates to a software defect data feature selection method based on mutual information. The method comprises the steps of: A. obtaining software module data and software feature data from a software data set, and preprocessing the data, which comprises label processing and feature classification; B. obtaining, through screening, most class-related features by using a mutual information theory, and then calculating correlation between the features to remove redundant software features, during which an unbalance coefficient is introduced to take imbalance between software data into account; and C. establishing a classification model according to an obtained optimal feature subset, classifying software modules, and verifying effectiveness of the selected features. According to the method provided by the present invention, the redundant software features are removed by using the mutual information theory, so that the calculation speed can be greatly increased, and for the imbalance between the software defect data, mutual information is modified, so as to improve classification accuracy of minority classes.

Description

Software defect data characteristics system of selection based on mutual information

Technical field

The invention belongs to field of software engineering, be specifically related to a kind of software defect data characteristics system of selection based on mutual information。

Background technology

At present, software system scale increases day by day and its logic complexity also strengthens day by day, increases along with the number of modules of existing defects in software, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss。Software defect Predicting Technique is to instruct a kind of important approach with assessment software test job, but the software features of higher-dimension adds the Time & Space Complexity of software module categorizing process, and affects the raising of nicety of grading。

Feature selection (FeatureSelection) is also referred to as feature subset selection (FeatureSubsetSelection, FSS), or Attributions selection (AttributeSelection), refer to from whole features, choose an optimal feature subset, make the model constructed better。But in actual applications, feature quantity is often more, wherein would be likely to occur incoherent feature, also likely to be present and interdepend between feature, it is easy to cause following consequence: Characteristic Number is more many, the time needed for analyzing feature, training pattern is more long。Characteristic Number is more many, it is easy to cause " dimension disaster ", and model also can be more complicated, and its Generalization Ability can decline。Such as, 2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly include three major types--LOC, McCabe and Halstead。In each class software features, except basic feature is directly to extract from source code, other feature is all obtained by these basic feature value indirect calculation。And have it is demonstrated experimentally that have only to three important software features just can forecasting software module whether containing defective。Visible, in each class software features, there is more redundancy feature。Feature selection can reject uncorrelated (irrelevant) or the feature of redundancy (redundant), thus reaching to reduce Characteristic Number, improving model accuracy, reducing the purpose of operation time。

Feature selection approach based on mutual information (mutualinformation) shows good performance in feature selection, because passing judgment on selected feature quality without the classification results according to grader, has the feature of fast operation。But, the feature selection approach major part being currently based on mutual information is the operation on equilibrium criterion collection, and for the unbalanced dataset existed a large amount of in actual life, the feature selection approach based on mutual information can not well play its effect。Therefore, the present invention is with software defect data for object of study, it is proposed to based on the feature selection approach for unbalanced data of mutual information。

Summary of the invention

It is an object of the invention to the disequilibrium overcoming software defect data to exist, it is provided that a kind of software defect data characteristics system of selection based on mutual information, to select optimal feature subset。

For achieving the above object, technical solution of the present invention mainly includes three below step:

A. concentrate from software data and obtain data, and do pretreatment

(1) collect software data and include software features data, software module data, do pretreatment。And software module data are divided into training set and test set in order to training and test。The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test。

(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。

(3) according to nine parts of training set data, the unbalance factor p of software module is obtained with formula (1)。

Formula (1)

B. Mutual Information Theory is utilized to remove Redundancy Software feature

(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to formula (2)_iWith class y₁And y₂Dependency μ (f_i), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。

μ(f_i)=logp × MI (f_i,y₁)+MI(f_i,y₂) formula (2)

Wherein,

M I (f_{i}, y_{1}) = l o g \frac{p (f_{i} | y_{1})}{p (f_{i})}, M I (f_{i}, y_{2}) = l o g \frac{p (f_{i} | y_{2})}{p (f_{i})},

P (f in formula_i) represent feature f_iAt the total probability that two apoplexy due to endogenous wind occur, p (f_i|y₁), p (f_i|y₂) respectively feature f_iAt the probability that two apoplexy due to endogenous wind occur。

(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。According to minimal redundancy criterion calculation formula (3), remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。

f_{i j} = m i n (f_{i}, f_{j}) = \frac{1}{| S |^{2}} Σ_{f_{i}, f_{j} &Element; S} M I (f_{i}, f_{j})

Formula (3)

C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, set up final

Disaggregated model。Classifying for software module, classification results adopts Gmeans to verify having of feature selection result

Effect property and degree of accuracy。The optimal feature subset S that final basis obtains sets up forecast model。

G m e a n s = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{D}{C + D}}

Formula (4)

Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。

Accompanying drawing explanation

Fig. 1 is based on the software defect data characteristics system of selection flow chart of mutual information。

Detailed description of the invention

Below in conjunction with Fig. 1, the present invention is described in further detail。

The first step: concentrate from software data and obtain data, and it is carried out pretreatment。

(1) first obtain software features collection and software module data, and do tag processes。Wherein, feature set F={f₁,f₂…f_m}。Software module data set { X, Y}, X={x₁,x₂…x_n, Y={y₁,y₂}={+1 ,-1}。If software module x_iZero defect, then (x_i,y_i)=(x_i,-1), otherwise, (x_i,y_i)=(x_i,+1)。

(2) feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。

(3) according to nine parts of training set data, following formula is utilized to obtain the unbalance factor p of software module。

Second step: utilize Mutual Information Theory to remove Redundancy Software feature

(1) each feature f in three feature sets is calculated according to following formula_iWith class y₁And y₂Dependency μ (f_i), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。

μ(f_i)=logp × MI (f_i,y₁)+MI(f_i,y₂)

Wherein,

M I (f_{i}, y_{1}) = l o g \frac{p (f_{i} | y_{1})}{p (f_{i})}, M I (f_{i}, y_{2}) = l o g \frac{p (f_{i} | y_{2})}{p (f_{i})},

(2) according further to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。

L category feature correlation matrix M category feature correlation matrix H category feature dependency square

According to following minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。

f_{i j} = m i n (f_{i}, f_{j}) = \frac{1}{| S |^{2}} \underset{f_{i}, f_{j} &Element; S}{Σ} M I (f_{i}, f_{j})

3rd step: utilize SVM that it is carried out classifying quality detection according to the software features selected, sets up final disaggregated model。

(1) input: optimal feature subset S, module data；Output: classification results。

(2) predicting for software defect, classification results adopts Gmeans to verify the performance of feature selection result。The optimal feature subset S that final basis obtains is predicted model and sets up。

G m e a n s = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{D}{C + D}}

The invention provides a kind of software defect data characteristics system of selection based on mutual information; it should be pointed out that, for those skilled in the art, under the premise without departing from the principles of the invention; can also making some improvement, these improvement also should be regarded as protection scope of the present invention。Each ingredient not clear and definite in the present embodiment is used equally to prior art and is realized。

Claims

1. based on the software defect data characteristics system of selection of mutual information, it is characterised in that mainly include three below step:

A. concentrate from software data and obtain data, and do pretreatment

(1) collect software data, including software features data and software module data, and do pretreatment；Software module data are divided into training set and test set in order to training and test；The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test；

(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)；

(3) according to nine parts of training set data, the unbalance factor p of software module is obtained by equation below；

B. Mutual Information Theory is utilized to remove Redundancy Software feature

(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to following formula_iWith class y₁And y₂Dependency μ (f_i), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus；

μ(f_i)=logp × MI (f_i,y₁)+MI(f_i,y₂)

Wherein,

M I (f_{i}, y_{1}) = l o g \frac{p (f_{i} | y_{1})}{p (f_{i})}, M I (f_{i}, y_{2}) = l o g \frac{p (f_{i} | y_{2})}{p (f_{i})},

P (f in formula_i) represent feature f_iAt the total probability that two apoplexy due to endogenous wind occur, p (f_i|y₁), p (f_i|y₂) respectively feature f_iAt the probability that two apoplexy due to endogenous wind occur；

(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix；

According to minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k；

f_{i j} = m i n (f_{i}, f_{j}) = \frac{1}{| S |^{2}} \underset{f_{i}, f_{j} &Element; S}{Σ} M I (f_{i}, f_{j})

C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, sets up final disaggregated model；Predicting for software defect, classification results adopts Gmeans to verify effectiveness and the degree of accuracy of feature selection result；The optimal feature subset S that final basis obtains sets up disaggregated model；

G m e a n s = \sqrt{\frac{A}{A + B}} \times \sqrt{\frac{D}{C + D}}