CN115269377B

CN115269377B - Cross-project software defect prediction method based on optimization instance selection

Info

Publication number: CN115269377B
Application number: CN202210717428.2A
Authority: CN
Inventors: 张瑞年; 王楚越; 王晨宇; 尹思文; 王超; 郭伟琪; 文万志; 胡彬
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2023-07-11
Anticipated expiration: 2042-06-23
Also published as: CN115269377A

Abstract

The invention provides a cross-project software defect prediction method based on optimization instance selection, which comprises the following steps: s1, constructing a project vector set PVS; s2, constructing a target instance optimization index IPI; s3, constructing a pre-training set TPRED; s4, constructing an optimization index TPOI of the target item; s5, constructing a training set BOD selected based on the optimization examples; s6, constructing a cross-project software defect prediction method BOICP based on optimization instance selection. The invention provides a cross-project software defect prediction method based on optimized example selection, which realizes source example selection by constructing a global feature vector of a target example, then further optimizes example selection by using correlation analysis, and a training set constructed by the method is beneficial to selecting reliable example data to realize better cross-project defect prediction effect.

Description

Cross-project software defect prediction method based on optimization instance selection

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a cross-project software defect prediction method based on optimized example selection, which aims at optimizing source example selection of a target example vector in a target project and further improves cross-project defect prediction results.

Background

Researchers need to implement software defect prediction with the help of historical data, however there is often insufficient historical data for a new system, and one way to solve this problem is to select historical data from other projects, use these to build defect prediction models and make defect predictions for new projects.

For an item with a large amount of data, a researcher needs to think about how to select instance data which is more suitable for a target item, and the more consistent the source instance data and the target item data are, the more accurate a defect prediction model is established.

Disclosure of Invention

The invention aims to provide a cross-project software defect prediction method based on optimized example selection, which is used for realizing source example selection by constructing global dynamic characteristics of target examples in the example selection process, and further optimizing example selection by using correlation analysis, thereby being beneficial to realizing better cross-project defect prediction effect.

In order to solve the above technical problems, an embodiment of the present invention provides a cross-project software defect prediction method based on optimization instance selection, including the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPRED;

s4, constructing an optimization index TPOI of the target item;

s5, constructing a training set BOD selected based on the optimization examples;

s6, constructing a cross-project software defect prediction method BOICP based on optimization instance selection.

Wherein, step S1 includes the following steps:

s1.1, acquiring a software project set based on an open source website;

s1.2, constructing a project instance set by taking a project class as an instance;

s1.3, constructing a traditional measurement element set { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBW, AMC, ca, ce, max_CC, avg_CC, LOC } based on an open source data history record, a project source code grammar structure and a source code abstract grammar tree, wherein WMC represents a weighting method of each class, DIT represents the depth of an inheritance tree, NOC represents the number of subclasses, CBO represents coupling among object classes, RFC represents the response of one class, LCOM and LCOM3 represent the condensation force lacking in the method, NPM represents the number of common classes, DAM represents a data access index, MOA represents an aggregated measure, MFA represents a measure of functional abstraction, CAM represents aggregation among class methods, IC represents inheritance coupling, CBW represents coupling among methods, AMC represents average method complexity, ca represents incoming coupling, ce represents outgoing coupling, max_CC represents the maximum value of McCabe circle complexity, avg_CC represents the average value of McCabe circle complexity, and LOC represents the number of lines of codes;

s1.4, processing all the examples in the source item according to the step S1.3 to obtain a traditional metric element vector set SCPIVS= [ instance ₁ , instance ₂ , …, instance _i ]Wherein i=1, 2,3, …, n;

s1.5, processing all the examples in the target item according to the step S1.3 to obtain a traditional metric element vector set TCPIVS= [ transformation_value ] of the target item ₁ , tradition_value ₂ , …, tradition_value _j ]Wherein j=1, 2,3, …, m;

s1.6, constructing a source item instance tag SLABEL= [ gag ] based on open source data history ₁ , stag ₂ , …, stag _i ]Wherein i=1, 2,3, …, n; the label corresponds to an instance in the conventional metric element vector set SCPIVS of the source item in the step S1.4;

s1.7, constructing a target item instance label TLABEL= [ ttag ] based on open source data history ₁ , ttag ₂ , …, ttag _j ]Wherein j=1, 2,3, …, m; the label corresponds to an instance in the traditional metric element vector set TCPIVS of the target item in the step S1.5;

s1.8, build item vector set pvs= { SCPIVS, SLABEL, TCPIVS, TLABEL }.

Wherein, step S2 includes the following steps:

s2.1, constructing a target instance optimization index IPI and a source instance index list ASI;

s2.2, selecting a target instance vector;

s2.3, if the list length of the IPI is empty, constructing a global feature vector GFV of the instance training set as a target instance vector of the step S2.2, otherwise, GFV is a set of the same metric standard deviation of all instances in the instance training set;

s2.4, constructing a source instance library SIL to be selected for the source item traditional metric meta-vector set SCPIVS selection instance in the step S1 by using the source instance index list ASI in the step S2.1;

s2.5, calculating Euclidean distance between each instance in the SIL of the source instance library to be selected and the GFV, and returning an index min-index corresponding to the minimum Euclidean distance;

s2.6, adding the min-index into the target instance optimization index IPI of the step S2.1;

s2.7, deleting the min-index in the source instance index list ASI;

s2.8, setting the number of source instances selected by each target instance as k, and circularly executing the steps S2.3-S2.7 until the length of the target instance optimization index IPI meets k;

s2.9, after the step S2.8 is executed, the target instance optimization index IPI is obtained.

Wherein, step S3 includes the following steps:

s3.1, executing each instance in the target item according to the step S2 to obtain a target instance optimization index IPI of each target instance;

s3.2, combining and de-duplicating the optimization indexes obtained by each target instance in the step S3.1, and constructing a pre-training set optimization index TIPI;

s3.3, selecting the conventional metric element vector set SCPIVS of the source project in the step S1 according to examples by using the pre-training set optimization index TIPI obtained in the step S3.2 to obtain an example vector set TPRED-D of the pre-training set;

s3.4, selecting the source item instance label SLABEL of the step S1 according to an instance by using the pre-training set optimization index TIPI obtained in the step S3.2 to obtain a label set TPRED-L of the pre-training set;

s3.5, constructing a pre-training set TPRED= { TPRED-D, TPRED-L }.

Wherein, step S4 includes the following steps:

s4.1, combining the instance vector set TPRED-D of the pre-training set obtained in the step S3 with the label set TPRED-L of the pre-training set according to columns, and placing the label set in the last column;

s4.2, calculating the direct correlation between each metric element and the last column of labels by using the clearman to obtain a correlation list CList;

s4.3, after taking absolute values of all elements of the correlation list CList in the step S4.2, sorting from large to small, and returning to the feature corresponding index;

s4.4, setting the number of the indexes of the selected correlation characteristic as q;

s4.5, selecting the feature indexes returned in the step S4.3 by using the number q of the correlation feature indexes in the step S4.4, and constructing a source item correlation feature set SPTFS by using the obtained feature indexes;

s4.6, constructing a target item correlation feature set TPTFS by using the target item traditional metric element vector set TCPIVS and the target item instance label TLABEL in the step S1 according to the steps S4.1-S4.5;

s4.7, calculating Euclidean distances between all source examples in the source project correlation feature set SPTFS and one example correlation feature set in the target project correlation feature set TPTFS, returning to an index list after sequencing from small to large, setting the number of selected examples in the SPTFS as p, and obtaining p indexes selected by the target examples;

s4.8, processing all target instances in the target item correlation feature set TPTFS according to the step S4.7 to obtain an optimized index set, and de-duplicating the index set to obtain an optimized index TPOI of the target item.

Wherein, step S5 includes the following steps:

s5.1, selecting an instance vector set TPRED-D of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a training feature set BOD-D selected based on an optimization instance;

s5.2, selecting a tag set TPRED-L of the pre-training set in the step S3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a tag set BOD-L selected based on an optimization example;

s5.3, constructing a training set BOD= { BOD-D, BOD-L } selected based on the optimization example.

Wherein, step S6 includes the following steps:

s6.1, obtaining a project vector set PVS= { SCPIVS, SLABEL, TCPIVS, TLABEL }, through the step S1;

s6.2, obtaining a target instance optimization index IPI through the step S2;

s6.3, obtaining an instance vector set TPRED-D of the pre-training set and a label set TPRED-L of the pre-training set through the step S3;

s6.4, obtaining an optimized index TPOI of the target item through the step S4;

s6.5, obtaining a training feature set BOD-D selected based on the optimization examples and a label set BOD-L selected based on the optimization examples through the step S5;

s6.6, performing model training on the training feature set BOD-D selected based on the optimization examples and the label set BOD-L selected based on the optimization examples in the step S6.5 by using a Logistic classification algorithm;

s6.7, performing defect prediction on the traditional metric element vector set TCPIVS of the target item in the step S1 by using the model trained in the step S6.6 to obtain a prediction LABEL set PRED_LABEL, and obtaining f-score through formula calculation by combining the target item instance LABEL TLABEL;

s6.8, obtaining a cross-project software defect prediction method BOICP selected based on the optimization examples.

The technical scheme of the invention has the following beneficial effects:

the invention provides a cross-project software defect prediction method based on optimization example selection, which comprises the steps of firstly constructing a global feature vector for each target example, using the vector to select examples from source projects, then using correlation analysis in a selected training set, using correlation characteristics of the examples to further select the source examples, forming a training data set by using all the selected source examples, and using the training data set to establish a cross-project defect prediction model, thereby being beneficial to realizing a better cross-project defect prediction effect.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the training set BOD selected based on the optimization examples in the present invention;

FIG. 3 is a chart showing selected example numbers at different k in the present invention;

FIG. 4 is a graph of f-score obtained using Logistic at different k in the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1, the invention provides a cross-project software defect prediction method based on optimization instance selection, which comprises the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPRED;

s4, constructing an optimization index TPOI of the target item;

The specific steps of constructing the project vector set PVS are as follows:

s1.1, acquiring a software project set based on an open source website;

s1.3, constructing feature sets { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM, NPM, DAM, MOA, MFA, CAM, IC, CBW, AMC, ca, ce, max_CC, avg_CC, LOC } based on an open source data history, a project source code grammar structure and a source code abstract grammar tree, wherein WMC represents a weighted method of each class, DIT represents the depth of an inheritance tree, NOC represents the number of subclasses, CBO represents coupling among object classes, RFC represents the response of one class, LCOM and LCOM3 represent the condensation force lacking in the method, NPM represents the number of common classes, DAM represents a data access index, MOA represents an aggregated measure, MFA represents a measure of functional abstraction, CAM represents aggregation among class methods, IC represents inheritance coupling, CBW represents coupling among methods, AMC represents average method complexity, ce represents outgoing coupling, max_CC represents the maximum value of McCabe circle complexity, avg_CC represents the average value of McCabe circle complexity, and LOC represents the number of lines of codes.

S1.4, processing all the examples in the source item according to the steps to obtain a traditional metric element vector set SCPIVS= [ transformation_value ] of the source item ₁ , tradition_value ₂ , …, tradition_ value _i ]Wherein i=1, 2,3, …, n;

s1.5, processing all the instances in the target item according to the same steps to obtain a traditional metric element vector set TCPIVS= [ transformation_value ] of the target item ₁ , tradition_value ₂ , …, tradition_value _j ]Where j=1, 2,3, …, m.

S1.6, constructing a source item instance tag SLABEL= [ gag ] based on open source data history ₁ , stag ₂ , …, stag _i ]Wherein i=1, 2,3, …, n; the label corresponds to an instance in the source project traditional metric element vector set SCPIVS;

s1.7, constructing a target item instance label TLABEL= [ ttag ] based on open source data history ₁ , ttag ₂ , …, ttag _j ]Wherein j=1, 2,3, …, m; the tag corresponds to an instance in the target item traditional metric meta-vector set TCPIVS.

S1.8, build item vector set pvs= { SCPIVS, SLABEL, TCPIVS, TLABEL }.

The specific steps of constructing the target instance optimization index IPI are as follows:

s2.1, constructing a target instance optimization index IPI and a source instance index list ASI.

S2.2, selecting a target instance vector;

s2.3, if the list length of the IPI is empty, constructing a global feature vector GFV of the instance training set as the target instance vector of the step S2.2, otherwise, GFV is the set of the same metric standard deviation of all instances in the instance training set.

S2.4, constructing a source instance library SIL to be selected for the source item traditional metric meta-vector set SCPIVS selection instance in the step S1 by using the source instance index list ASI in the step S2.1.

S2.5, calculating Euclidean distance between each instance in the SIL of the source instance library to be selected and the GFV, and returning an index min-index corresponding to the minimum Euclidean distance.

s2.7, deleting the min-index in the source instance index list ASI.

S2.8, setting the number of source instances selected by each target instance as 5, and circularly executing the steps S2.3-S2.7 until the length of the target instance optimization index IPI meets k;

The specific steps of constructing the pre-training set TPRED are as follows:

s3.1, executing each instance in the target item according to the steps S2.1-S2.9 to obtain a target instance optimization index IPI of each target instance;

and S3.2, combining and de-duplicating the optimization indexes obtained by each target instance in the step S3.1, and constructing a pre-training set optimization index TIPI.

S3.3, selecting the conventional metric element vector set SCPIVS of the source project in the step S1.8 according to examples by using the pre-training set optimization index TIPI obtained in the step S3.2 to obtain an example vector set TPRED-D of the pre-training set;

s3.4, selecting the source item instance label SLABEL of the step S1.6 according to the instance by using the pre-training set optimization index TIPI obtained in the step S3.2, and obtaining a label set TPRED-L of the pre-training set.

S3.5, constructing a pre-training set TPRED= { TPRED-D, TPRED-L }.

The specific steps of constructing the optimization index TPOI of the target item are as follows:

s4.1, combining the instance vector set TPRED-D of the pre-training set obtained in the step S3.5 with the label set TPRED-L of the pre-training set according to columns, and placing the label set in the last column;

s4.2, calculating the direct correlation between each metric element and the last column of labels by using the clearman to obtain a correlation list CList.

s4.4, setting the number of the selected correlation characteristic indexes to 10;

and S4.5, selecting the feature indexes returned in the step S4.3 by using the number of the correlation feature indexes in the step S4.4, and constructing a source item correlation feature set SPTFS by using the obtained feature indexes.

S4.6, constructing a target item correlation feature set TPTFS by using the target item traditional metric element vector set TCPIVS of the step S1.5 and the target item instance label TLABEL of the step S1.7 according to the steps S4.1-S4.5.

S4.7, calculating Euclidean distances between all source examples in the source project correlation feature set SPTFS and one example correlation feature set in the target project correlation feature set TPTFS, returning to an index list after sequencing from small to large, setting the number of selected examples 2 from the SPTFS, and obtaining 2 indexes selected by the target examples.

Step S5, the specific steps of constructing a training set BOD selected based on the optimization examples are as follows:

s5.1, selecting an instance vector set TPRED-D of the pre-training set in the step S3.3 by using the optimization index TPOI of the target item obtained in the step S4 to obtain a training feature set BOD-D selected based on the optimization instance.

S5.2, selecting the tag set TPRED-L of the pre-training set in the step S3.3 by using the optimization index TPOI of the target item obtained in the step S4, and obtaining the tag set BOD-L selected based on the optimization example.

A flowchart for constructing the training set BOD selected based on the optimization instance is shown in fig. 2.

Step S6, constructing a cross-project software defect prediction method BOICP selected based on an optimization example, wherein the specific steps are as follows:

ivy-2.0 was selected as the source item and synapse-1.2 was selected as the target item. And constructing a source project traditional metric element vector set SCPIVS and a source project instance label SLABEL according to the source project instance condition, and constructing a target project traditional metric element vector set TCPIVS and a target project instance label TLABEL according to the target project instance condition.

The target instance optimization index IPI when the number of different source instances is selected is obtained according to the defined method for constructing the target instance optimization index.

And obtaining an instance vector set TPRED-D of the pre-training set and a label set TPRED-L of the pre-training set according to the method for constructing the pre-training set.

And obtaining the optimized index TPOI of the target item according to the optimized index method for constructing the target item defined above.

The number of selected features k is set in the range of 1 to 5, and at each k, an example selection training set bod= { BOD-D, BOD-D } based on the feature subset is obtained.

Establishing a classification model for the example selection training set BSID based on the feature subset by using a Logistic classifier and predicting, wherein experiments show that the maximum f-score value obtained by the model is 0.343, and the value is greater than 0.149 obtained without using the example selection method; the model performance established by the invention is superior to the model established without using the example selection method, thereby showing the effectiveness of the cross-project software defect prediction method based on the optimization example selection.

The number of examples selected for different k is shown in figure 3.

The f-score obtained using Logistic at different k is shown in FIG. 4.

According to the method, a global feature vector is firstly constructed for each target instance, instance selection is carried out from source items by using the vector, then correlation analysis is used in a selected training set, source instances are further selected by using the correlation features of the instances, all the selected source instances form a training data set, and a cross-item defect prediction model is built by using the training data set, so that better cross-item defect prediction effect is achieved.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The cross-project software defect prediction method based on the optimization example selection is characterized by comprising the following steps:

s1, constructing a project vector set PVS;

s2, constructing a target instance optimization index IPI;

s3, constructing a pre-training set TPRED;

s4, constructing an optimization index TPOI of the target item;

s6, constructing a cross-project software defect prediction method BOICP based on optimization instance selection;

step S1 comprises the steps of:

s1.1, acquiring a software project set based on an open source website;

s1.8, constructing a project vector set PVS= { SCPIVS, SLABEL, TCPIVS, TLABEL };

step S2 includes the steps of:

s2.2, selecting a target instance vector;

s2.7, deleting the min-index in the source instance index list ASI;

s2.9, after the step S2.8 is executed, obtaining a target instance optimization index IPI;

step S3 includes the steps of:

s3.5, constructing a pre-training set TPRED= { TPRED-D, TPRED-L };

step S4 includes the steps of:

s4.8, processing all target examples in the target item correlation feature set TPTFS according to the step S4.7 to obtain an optimized index set, and de-duplicating the index set to obtain an optimized index TPOI of the target item;

step S5 includes the steps of:

s5.3, constructing a training set BOD= { BOD-D, BOD-L } selected based on the optimization example;

step S6 includes the steps of:

s6.2, obtaining a target instance optimization index IPI through the step S2;

s6.4, obtaining an optimized index TPOI of the target item through the step S4;