CN115269377A

CN115269377A - Cross-project software defect prediction method based on optimization instance selection

Info

Publication number: CN115269377A
Application number: CN202210717428.2A
Authority: CN
Inventors: 张瑞年; 王楚越; 王晨宇; 尹思文; 王超; 郭伟琪; 文万志; 胡彬
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-11-01
Anticipated expiration: 2042-06-23
Also published as: CN115269377B

Abstract

The invention provides a cross-project software defect prediction method based on optimization instance selection, which comprises the following steps: s1, constructing a project vector set PVS; s2, constructing a target instance optimization index IPI; s3, constructing a pre-training set TPRED; s4, constructing an optimized index TPOI of the target project; s5, constructing a BOD (BOD) of a training set selected based on an optimization example; and S6, constructing a cross-project software defect prediction method BOICP selected based on the optimized examples. The invention provides a cross-project software defect prediction method based on optimization case selection.

Description

Cross-project software defect prediction method based on optimization instance selection

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a cross-project software defect prediction method based on optimization case selection.

Background

Researchers need to predict software defects by using historical data, however, a new system often has insufficient historical data, and one way to solve the problem is to select historical data from other projects, use the historical data to build a defect prediction model and predict defects of new projects.

For a project with a large amount of data, researchers need to think how to select example data which is more suitable for a target project, and the more the source example data and the target project data are matched, the more accurate the established defect prediction model is.

Disclosure of Invention

The invention aims to solve the technical problem of providing a cross-project software defect prediction method based on optimization case selection.

In order to solve the above technical problem, an embodiment of the present invention provides a cross-project software defect prediction method based on optimization instance selection, including the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPREDD;

s4, constructing an optimized index TPOI of the target project;

s5, constructing a BOD (Biochemical oxygen demand) training set selected based on the optimized example;

and S6, constructing a cross-project software defect prediction method BOICP selected based on the optimized examples.

Wherein, step S1 includes the following steps:

s1.1, acquiring a software project set based on an open source website;

s1.2, constructing a project instance set by taking a project class as an instance;

s1.3, constructing a traditional set of metrics { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBM, AMC, ca, ce, maxCC, avg _ CC, LOC }, based on the open source data history, project source code syntax structure, source code abstract syntax trees, wherein WMC represents a weighted method for each class, DIT represents the depth of the inheritance tree, NOC represents the number of subclasses, CBO represents the coupling between object classes, RFC represents the response of a class, LCOM and LCOM3 represent the cohesion lacking in the method, NPM represents the number of common classes, DAM represents a data access index, MOA represents a measure of aggregation, MFA represents a measure of functional abstraction, CAM represents an aggregation between class methods, IC represents an inheritance coupling, CBW represents a coupling between methods, AMC represents the average method complexity, ca represents an incoming coupling, ce represents an outgoing complexity, mcMFCC _ CC represents the maximum of a circle, and Cabe _ CC represents the average value of an Ave _ CC;

s1.4, processing all instances in the source project according to the step S1.3 to obtain a source project traditional measurement element vector set SCPIVS = [ instance =₁,instance₂,…,instance_i]Wherein i =1,2,3, …, n;

s1.5, processing all instances in the target project according to the step S1.3 to obtain a target project traditional measurement element vector set TCPIVS = [ translation _ value =₁,tradition_value₂,…,tradition_value_j]Wherein j =1,2,3, …, m;

s1.6, constructing a source item instance label SLABEL = [ stag ] based on open source data historical records₁,stag₂,…,stag_i]Wherein i =1,2,3, …, n; the label corresponds to an instance in the source project traditional measurement element vector set SCPIVS in the step S1.4;

s1.7, building a target item instance label TLABEL = [ ttag ] based on open source data historical records₁,ttag₂,…,ttag_j]Wherein j =1,2,3, …, m; the label corresponds to an instance in the target item traditional measurement vector set TCPIVS in the step S1.5;

s1.8, constructing a project vector set PVS = { SCPIVS, SLABEL, TCPIVS, TLABEL }.

Wherein, step S2 includes the following steps:

s2.1, constructing an optimized index empty list IPI and a source instance index list ASI;

s2.2, selecting a target instance vector;

s2.3, if the list length of the IPI is empty, constructing a global feature vector GFV of the example training set as a target example vector of the step S2.2, otherwise, the GFV is a set of standard deviations of the same metric cells of all examples in the example training set;

s2.4, constructing a source example library SIL to be selected for the source item traditional measurement element vector set SCPIVS selection example in the step S1 by using the source example index list ASI in the step S2.1;

s2.5, calculating the Euclidean distance between each instance in the source instance library SIL to be selected and the GFV, and returning an index min-index corresponding to the minimum Euclidean distance;

s2.6, adding the min-index into the IPI in the optimized index list in the step S2.1;

s2.7, deleting the min-index in the ASI;

s2.8, setting the number of the source instances selected by each target instance to be k, and circularly executing the step S2.3 ℃

S2.7, until the length of the IPI of the optimized index list meets k;

and S2.9, obtaining the target instance optimization index IPI after the step S2.8 is executed.

Wherein, step S3 includes the following steps:

s3.1, executing each instance in the target project according to the step S2 to obtain a target instance optimization index IPI of each target instance;

s3.2, combining and de-duplicating the optimized indexes obtained by each target instance in the step S3.1, and constructing a pre-training set optimized index TIPI;

s3.3, selecting the traditional metric vector set SCPIVS of the source item in the step S1 according to an example by using the optimal index TIPI of the pre-training set obtained in the step S3.2 to obtain an example vector set TPRED-D of the pre-training set;

s3.4, selecting the source item instance label SLABEL in the step S1 according to the instance by using the pre-training set optimization index TIPI obtained in the step S3.2 to obtain a label set TPRED-L of the pre-training set;

s3.5, constructing a pre-training set TPRED = { TPRED-D, TPRED-L }.

Wherein, step S4 includes the following steps:

s4.1, combining the example vector set TPRED-D of the pre-training set obtained in the step S3 with the label set TPRED-L of the pre-training set according to columns, and placing the label set in the last column;

s4.2, calculating the direct correlation between each measurement element and the last column of labels by using the spearman to obtain a correlation list CList;

s4.3, sorting all the elements of the correlation list CList in the step S4.2 from big to small after taking absolute values, and returning a feature corresponding index;

s4.4, setting the number of the selected correlation characteristic indexes as q;

s4.5, selecting the feature index returned in the step S4.3 by using the number q of the relevant feature indexes in the step S4.4, and constructing a source item relevant feature set SPTFS by using the obtained feature index;

s4.6, constructing a target item relevance feature set TPTFS by using the target item traditional metric vector set TCPIVS and the target item instance label TLABEL in the step S1 according to the steps S4.1-S4.5;

s4.7, calculating Euclidean distances of all source examples in the source item relevance feature set SPTFS and one example relevance feature set in the target item relevance feature set TPTFS, returning an index list after the Euclidean distances are sorted from small to large, setting the number of selected examples in the SPTFS as p, and then obtaining p indexes selected by the target examples;

and S4.8, processing all target instances in the target item correlation feature set TPTFS according to the step S4.7 to obtain an optimized index set, and removing the index set to obtain an optimized index TPOI of the target item.

Wherein, step S5 includes the following steps:

s5.1, selecting an example vector set TPRED-D of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a training feature set BOD-D selected based on the optimization example;

s5.2, selecting the tagset TPRED-L of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a tagset BOD-L selected based on the optimization example;

and S5.3, constructing a BOD = { BOD-D, BOD-L } of the training set selected based on the optimization example.

Wherein, step S6 comprises the following steps:

s6.1, obtaining an item vector set PVS = { SCPIVS, SLABEL, TCPIVS, TLABEL } through the step S1;

s6.2, obtaining a target instance optimization index IPI through the step S2;

s6.3, obtaining an example vector set TPRED-D of the pre-training set and a label set TPRED-L of the pre-training set through the step S3;

s6.4, obtaining an optimized index TPOI of the target item through the step S4;

s6.5, obtaining the BOD-D of the training feature set selected based on the optimization example and the BOD-L of the tag set selected based on the optimization example through the step S5;

s6.6, performing model training on the BOD-D of the training feature set selected based on the optimization example and the BOD-L of the tag set selected based on the optimization example in the step S6.5 by using a Logistic classification algorithm;

s6.7, performing defect prediction on the target item traditional measurement element vector set TCPIVS obtained in the step S1 by using the model obtained through training in the step S6.6 to obtain a prediction LABEL set PRED _ LABEL, and calculating by combining a target item instance LABEL TLABEL through a formula to obtain f-score;

and S6.8, obtaining a cross-project software defect prediction method BOICP selected based on the optimized example.

The technical scheme of the invention has the following beneficial effects:

the invention provides a cross-project software defect prediction method based on optimization case selection.

Drawings

FIG. 1 is a block flow diagram of the present invention;

FIG. 2 is a BOD flow diagram of a training set selected based on an optimization example in the present invention;

FIG. 3 is a graph of selected example numbers at different k in the present invention;

FIG. 4 is a graph of f-score obtained using Logistic at different k's according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1, the present invention provides a cross-project software defect prediction method based on optimization case selection, comprising the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPRED;

s4, constructing an optimized index TPOI of the target project;

s5, constructing a BOD (BOD) of a training set selected based on an optimization example;

Step S1, the concrete steps of constructing the project vector set PVS are as follows:

s1.1, acquiring a software project set based on an open source website;

s1.3, constructing a feature set { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBM, AMC, ca, ce, max _ CC, avg _ CC, LOC } based on open source data history, project source code syntax structure, source code abstraction syntax trees, wherein WMC represents a weighted method for each class, DIT represents a depth of an inheritance tree, NOC represents a number of subclasses, CBO represents a coupling between object classes, RFC represents a response of a class, LCOM and LCOM3 represent a lack of cohesion on a method, NPM represents a number of common classes, DAM represents a data access index, MOA represents a measure of aggregation, MFA represents a measure of functional abstraction, a measure represents an aggregation between CAM class methods, IC represents an inheritance coupling, CBW represents a coupling between methods, AMC represents an average method complexity, ca represents an incoming coupling, ce represents a complexity coupling, cabe _ CC represents a complexity, and Cabe _ Max represents a maximum number of mcg _ CC rings of mcbe _ CC and a number of rows of mcbe _ CC represents an average number of mcbe codes.

S1.4, processing all instances in the source project according to the steps to obtain a source project traditional measurement element vector set SCPIVS = [ translation _ value =₁,tradition_value₂,…,tradition_value_i]Wherein i =1,2,3, …, n;

s1.5, processing all instances in the target project according to the same steps to obtain a target project traditional measurement element vector set TCPIVS = [ translation _ value =₁,tradition_value₂,…,tradition_value_j]Where j =1,2,3, …, m.

S1.6, constructing a source item instance label SLABEL = [ stag ] based on open source data historical records₁,stag₂,…,stag_i]Wherein i =1,2,3, …, n; the label corresponds to an instance in a source project traditional measurement element vector set SCPIVS;

s1.7, building a target item instance label TLABEL = [ ttag ] based on open source data historical records₁,ttag₂,…,ttag_j]Wherein j =1,2,3, …, m; the label corresponds to an instance in the target item conventional measurement element vector set TCPIVS.

S2, the specific steps of constructing the target instance optimization index IPI are as follows:

s2.1, constructing an optimized index empty list IPI and a source instance index list ASI.

S2.2, selecting a target instance vector;

and S2.3, if the list length of the IPI is empty, constructing a global feature vector GFV of the example training set as a target example vector of the step S2.2, otherwise, the GFV is a set of the same metric standard deviation of all examples in the example training set.

S2.4, constructing a source example library SIL to be selected for the source item traditional measurement element vector set SCPIVS selection example in the step S1 by using the source example index list ASI in the step S2.1.

S2.5, calculating the Euclidean distance between each instance in the source instance library SIL to be selected and the GFV, and returning the index min-index corresponding to the minimum Euclidean distance.

s2.7, deleting the min-index in the ASI.

S2.8, setting the number of the source instances selected by each target instance to be 5, and circularly executing the step S2.3-the step S2.7 until the length of the IPI of the optimized index list meets k;

S3, constructing a pre-training set TPRED specifically comprises the following steps:

s3.1, executing each instance in the target project according to the steps S2.1-S2.9 to obtain a target instance optimization index IPI of each target instance;

and S3.2, combining and de-duplicating the optimized indexes obtained by each target instance in the step S3.1, and constructing a pre-training set optimized index TIPI.

S3.3, selecting the traditional metric vector set SCPIVS of the source item in the step S1.8 according to an example by using the optimal index TIPI of the pre-training set obtained in the step S3.2 to obtain an example vector set TPRED-D of the pre-training set;

and S3.4, selecting the source item instance label SLABEL in the step S1.6 according to the instance by using the pre-training set optimization index TIPI obtained in the step S3.2 to obtain a label set TPRED-L of the pre-training set.

S3.5, constructing a pre-training set TPRED = { TPRED-D, TPRED-L }.

S4, the specific steps of constructing the optimized index TPOI of the target item are as follows:

s4.1, combining the example vector set TPRED-D of the pre-training set obtained in the step S3.5 and the label set TPRED-L of the pre-training set according to columns, and placing the label set in the last column;

and S4.2, calculating the direct correlation between each metric element and the last list of tags by using the spearman to obtain a correlation list CList.

s4.4, setting and selecting the number 10 of the correlation characteristic indexes;

and S4.5, selecting the feature index returned in the step S4.3 by using the number of the relevant feature indexes in the step S4.4, and constructing a source item relevance feature set SPTFS by using the obtained feature index.

S4.6, constructing a target item relevance feature set TPTFS by using the target item traditional metric vector set TCPIVS in the step S1.5 and the target item instance label TLABEL in the step S1.7 according to the steps S4.1-S4.5.

S4.7, calculating Euclidean distances of all source examples in the source item correlation characteristic set SPTFS and one example correlation characteristic set in the target item correlation characteristic set TPTFS, returning to an index list after the Euclidean distances are sorted from small to large, setting the number of the examples selected from the SPTFS to be 2, and obtaining 2 indexes selected by the target examples.

S5, constructing a BOD (BOD) training set selected based on the optimized example, and comprising the following specific steps of:

s5.1, selecting the example vector set TPRED-D of the pre-training set in the step S3.3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a training feature set BOD-D selected based on the optimization example.

S5.2, selecting the label set TPRED-L of the pre-training set in the step S3.3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a label set BOD-L selected based on the optimization example.

A flow chart for constructing a BOD for a training set selected based on an optimization instance is shown in fig. 2.

S6, constructing a cross-project software defect prediction method BOICP based on optimization instance selection, and specifically comprising the following steps:

ivy-2.0 is selected as the source project and synapse-1.2 is selected as the target project. And constructing a source project traditional measurement element vector set SCPIVS and a source project instance label SLABEL according to the situation of the source project instance, and constructing a target project traditional measurement element vector set TCPIVS and a target project instance label TLBEL according to the situation of the target project instance.

And obtaining the target instance optimization index IPI when different source instance numbers are selected according to the defined method for constructing the target instance optimization index.

And obtaining an example vector set TPRED-D of the pre-training set and a label set TPRED-L of the pre-training set according to the pre-training set constructing method defined above.

And obtaining the optimized index TPOI of the target item according to the defined optimized index method for constructing the target item.

The number of selected features k is set in the range of 1 to 5, and BOD = { BOD-D, BOD-D } of the training set based on the instance selection of the feature subset is obtained at each k.

A Logistic classifier is used for establishing a classification model for the example selection training set BFSID based on the characteristic subset and predicting, experiments show that the f-score value obtained by the model is 0.343 at most, and the value is larger than 0.149 obtained by the example selection method; the performance of the model established by the method is superior to that of the model which is not established by using the example selection method, so that the effectiveness of the cross-project software defect prediction method based on optimization example selection is shown.

The number of selected instances at different k is shown in fig. 3.

The f-score obtained using Logistic at different k is shown in FIG. 4.

According to the method, a global feature vector is constructed for each target instance, the vector is used for selecting the instances from the source projects, then correlation analysis is used in the selected training set, the source instances are further selected by using the correlation features of the instances, all the selected source instances form a training data set, and a cross-project defect prediction model is established by using the training data set, so that a better cross-project defect prediction effect is realized.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A cross-project software defect prediction method based on optimization instance selection is characterized by comprising the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPRED;

s4, constructing an optimized index TPOI of the target project;

2. The optimization instance selection-based cross-project software defect prediction method according to claim 1, wherein the step S1 comprises the steps of:

s1.1, acquiring a software project set based on an open source website;

s1.7, constructing a target item instance label TLABEL = [ ttag ] based on open source data history records₁,ttag₂,…,ttag_j]Wherein j =1,2,3, …, m; the label corresponds to an instance in the target item traditional measurement element vector set TCPIVS in the step S1.5;

3. The optimization instance selection-based cross-project software defect prediction method according to claim 1, wherein the step S2 comprises the steps of:

s2.2, selecting a target instance vector;

s2.7, deleting the min-index in the ASI;

s2.8, setting the number of the source instances selected by each target instance as k, and circularly executing the steps S2.3-S2.7 until the length of the IPI (index optimization) list meets k;

4. The optimization instance selection-based cross-project software defect prediction method according to claim 1, wherein the step S3 comprises the steps of:

s3.5, constructing a pre-training set TPRED = { TPRED-D, TPRED-L }.

5. The optimization instance selection-based cross-project software defect prediction method according to claim 1, wherein the step S4 comprises the steps of:

6. The method of claim 1, wherein step S5 comprises the steps of:

s5.1, selecting an example vector set TPREDD-D of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a training feature set BOD-D selected based on the optimization example;

s5.2, selecting the label set TPRED-L of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a label set BOD-L selected based on the optimization example;

7. The method of claim 1, wherein step S6 comprises the steps of:

s6.2, obtaining a target instance optimization index IPI through the step S2;

s6.4, obtaining the optimized index TPOI of the target item through the step S4;

s6.6, performing model training on the training feature set BOD-D selected based on the optimization example and the tag set BOD-L selected based on the optimization example in the step S6.5 by using a Logistic classification algorithm;

s6.7, performing defect prediction on the target item traditional measurement element vector set TCPIVS in the step S1 by using the model obtained through training in the step S6.6 to obtain a prediction LABEL set PRED _ LABEL, and calculating by combining a target item instance LABEL TLABEL through a formula to obtain f-score;