CN105389598A

CN105389598A - Feature selecting and classifying method for software defect data

Info

Publication number: CN105389598A
Application number: CN201511003241.2A
Authority: CN
Inventors: 李克文; 邹晶杰
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2015-12-28
Filing date: 2015-12-28
Publication date: 2016-03-09

Abstract

The invention discloses a feature selecting and classifying method for software defect data, which can be used for guiding the whole process of classifying software defect data, and comprises the following steps of: A, obtaining data from a software data set, and pre-processing data, wherein the data pre-processing includes labeling data and dividing software features into three categories according to known experiential knowledge; B, working out the feature with the maximum category correlation and minimal feature correlation according to the theory of mutual information, and adding into an optimal software feature set; C, classifying the selected software features by a classifier, and arranging the same in an ascending order according to classification results; and D, forecasting the defects of a software module by using a two-dimensional loop cascade Adaboost and an optimal feature sub-set, to accurately remove samples without defects in real time, thus reducing operation time. According to the feature selecting and classifying method for software defect data, the unbalancedness of the software defect data is overcome, the most categories are removed in real time to balance the data set, and thereby the operation time is reduced, and the operation efficiency is improved.

Description

The feature selecting of software-oriented defective data and sorting technique

Technical field

The invention belongs to soft project application, be specifically related to feature selecting and the sorting technique of a kind of software-oriented defective data.

Background technology

At present, software systems scale increases day by day and its logic complexity also strengthens day by day, and the module along with existing defects in software increases, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss.Software defect forecasting techniques, can the distribution situation of forecasting software defect exactly as instructing and a kind of important approach of assessment Experience of Software Testing Work, and this has important realistic meaning for raising software quality.For software systems, rational prediction defect can be added up and not yet be found but still the defect counts existed and defect distribution.The key of software defect prediction finds defective module, and this is two classification problems in essence, is divided into " defectiveness " and " zero defect " two class by software module.The prerequisite of classification carries out feature selecting, and the optimal feature subset according to selecting is classified.But in practical operation, there are following two difficult points in software defect forecasting process:

(1) there is bulk redundancy feature in software features

2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly comprise three major types--LOC, McCabe and Halstead.In each class software features, except essential characteristic directly extracts from source code, other feature is all obtained by these essential characteristic value indirect calculation.And have experiment to prove, whether the software features that only needs three are important just forecasting software module can contain defectiveness.Visible, in each class software features, there is more redundancy feature.A large amount of redundancies or uncorrelated features participate in computing, will certainly reduce arithmetic speed and efficiency.Therefore, need to carry out dimension-reduction treatment to software features, according to the classification of software features, in each class, select the feature large to software defect predicted impact power.

(2) there is serious unbalancedness in software module data

In the software module of reality, module (minority class) quantity of " defectiveness " will far less than the module of " zero defect " (most class), therefore, software defect prediction is also the classification problem of unbalanced data, and this is also data mining study hotspot in recent years.In forecasting process, target detects that the module of " defectiveness " is repaired.But software data is concentrated exists a large amount of most classes and zero defect module, the time of their at substantial in computing and resource.Therefore, detect most class of " zero defect " as early as possible exactly and removed data set thus reduce follow-up operand, this is significant for the efficiency improving whole assorting process.

For above two problems, work out the feature selecting meeting software data characteristic and the sorting technique of complete set, this is for improving software defect prediction effect, reducing operation time, significant.

Summary of the invention

The object of the invention is to solve the feature selecting of software-oriented defective data and low, the consuming time problem of a specified duration of efficiency existing for sorting technique, feature selecting and the sorting technique of a kind of software-oriented defective data are provided, to reduce operation time, improve operation efficiency.

For achieving the above object, technical solution of the present invention mainly comprises following four steps:

A. concentrate from software data and obtain data, pre-service is carried out to data

(1) data comprise software features collection, software module, software module data are divided into training set and test set in order to training and testing.The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test.And data are done tag processes.

(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class.

B. optimum software features collection is obtained according to Mutual Information Theory

(1) each feature f in three feature sets is calculated according to Mutual Information Theory _iwith class y ₁and y ₂correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset.

(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H}.

C. the software features selected is sorted by classifying quality

(1) optimal characteristics will obtained, inputs SVM successively and trains it.

(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as: S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.

D. two-dimentional circulation cascade Adaboost and feature set S is utilized to classify to software module

(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter.Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting.

(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S.Input this feature tlv triple, carry out first order classification.Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination.If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample.

(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations.The like, until level V, adopt five feature tlv triple.Finally obtain classification results.

Accompanying drawing explanation

The feature selecting of Tu1Shi software-oriented defective data and sorting technique process flow diagram.

Fig. 2 is two-dimentional circulation cascade Adaboost software prediction model.

Embodiment

Below in conjunction with Fig. 1, the present invention is described in further detail.

The first step: concentrate from software data and obtain data, pre-service is carried out to data

(1) first obtain software features collection and software module data, and training set is done tag processes.Wherein, feature set F={f ₁, f ₂f _m.Software module data set { X, Y}, X={x ₁, x ₂x _n, Y={y ₁, y ₂}={+1 ,-1}.If software module x _izero defect, then (x _i, y _i)=(x _i,-1), otherwise, (x _i, y _i)=(x _i,+1).

(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, and McCabe class and Halstead class, be abbreviated as L, M, H.

Second step: obtain optimum software features collection according to Mutual Information Theory

According to Mutual Information Theory, the correlativity between any Two Variables can be calculated, as shown in the formula

M I (x, y) = l o g \frac{p (x, y)}{p (x) p (y)}

Wherein, the marginal distribution probability that p (x) and p (y) is x and y, p (x, y) is the joint distribution probability of x and y.

3rd step: the software features selected is sorted by classifying quality

(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features.Get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, be designated as (l, h, m), then optimal feature subset S can be expressed as S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.

4th step: utilize two-dimentional circulation cascade Adaboost and feature set S to classify to software module

The invention provides feature selecting and the sorting technique of a kind of software-oriented defective data; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvement, these improvement also should be considered as protection scope of the present invention.Each ingredient not clear and definite in the present embodiment all can be used for prior art and is realized.

Claims

1. the feature selecting of software-oriented defective data and sorting technique, is characterized in that, mainly comprise following three steps:

(1) data comprise software features collection, software module; Software module data be divided into training set and test set in order to training and testing; The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test; And data are done tag processes;

(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class;

(1) each feature f in three feature sets is calculated according to Mutual Information Theory _iwith class y ₁and y ₂correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset;

(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H};

C. the software features selected is sorted by classifying quality

(1) optimal characteristics will obtained, inputs SVM successively and trains it;

(2) disaggregated model trained is acted on test set, after obtaining classification results, size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as:

S＝{(l，h，m)|l∈L，h∈H，m∈M}；

(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter; Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting;

(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S; Input this feature tlv triple, carry out first order classification; Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination; If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample;

(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations; The like, until level V, adopt five feature tlv triple; Finally obtain classification results.