CN113176998A

CN113176998A - Cross-project software defect prediction method based on source selection

Info

Publication number: CN113176998A
Application number: CN202110503077.0A
Authority: CN
Inventors: 文万志; 张瑞年; 朱宁波; 陈义; 尹思文; 李元金; 程实
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-07-27

Abstract

The invention provides a cross-project software defect prediction method based on source selection, which comprises the following steps: s1, constructing a data set; s2, constructing a feature selection method set FSelection; s3, obtaining an optimal feature selection method BFmethod; s4, obtaining an optimal feature quantity FThreshold; s5, constructing a source project selection method set SPselection; s6, constructing a cross-project defect prediction method CPSPM based on source selection. The invention provides a method for selecting various source items, which can provide better source items for subsequent data training and can effectively improve the efficiency of software defect prediction.

Description

Cross-project software defect prediction method based on source selection

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a cross-project software defect prediction method based on source selection, which is mainly used for optimizing the data set quality in the aspect of source project selection and further improving a software defect prediction result.

Background

In the process of rapid development of software development, developers can invisibly generate some software errors in the development process, and the software errors are software defects. The hidden danger of the software defect is very large, the hidden danger not only can affect the use experience and the software quality of a user, but also can endanger social security, so that a potential software defect existing in the software needs to be discovered earlier.

Software defect prediction can help software developers effectively predict potential defects of software, and recently, researchers have proposed a plurality of methods which are mainly used for improving the result of software defect prediction. However, software prediction across projects is extremely difficult, mainly because the data distribution difference between the source project and the target project is large, and the prediction effect is poor.

Disclosure of Invention

The invention aims to provide a cross-project software defect prediction method based on source selection, which improves the accuracy of cross-project software defect prediction, can effectively assist software developers to use the prediction model to reduce defects in the software development process, and has higher accuracy and efficiency.

To solve the above technical problem, an embodiment of the present invention provides a cross-project software defect prediction method based on source selection, including the following steps:

s1, constructing a data set;

s2, constructing a feature selection method set FSelection;

s3, obtaining an optimal feature selection method BFmethod;

s4, obtaining an optimal feature quantity FThreshold;

s5, constructing a source project selection method set SPselection;

s6, constructing a cross-project defect prediction method CPSPM based on source selection.

The specific steps of step S1 are:

s1.1, acquiring a software project set based on an open source website;

s1.2, constructing a project instance set by taking a project class as an instance;

s1.3, constructing feature sets { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBM, AMC, Ca, Ce, Max _ CC, Avg _ CC, LOC } based on open source data history records, project source code syntactic structures and source code abstract syntax trees;

wherein WMC represents a weighted method for each class; DIT represents the depth of the inheritance tree; NOC represents the number of subclasses; CBO represents the coupling between object classes; RFC stands for a class of responses; LCOM and LCOM3 represent the lack of cohesion on the process; NPM represents the number of public classes; DAM represents a data access index; MOA represents a measure of polymerization; MFA represents a measure of functional abstraction; CAM represents an aggregation between class methods; IC stands for legacy coupling; CBW represents the coupling between methods; AMC represents the average method complexity; ca represents afferent coupling; ce stands for outgoing coupling; max _ CC represents the maximum value of McCabe circle complexity; avg _ CC represents the average value of McCabe circle complexity; LOC represents the number of lines of the code;

s1.4, forming a defect prediction data set DATASET based on the examples and the characteristics.

The specific steps of step S2 are:

FSelection＝{RF，CL，GR，IG，OR，SU}；

wherein the RF method evaluates the value of an attribute by iteratively sampling an instance and taking into account the value of the given attribute in the most recent instances of the same class and of different classes, it can operate on discrete and continuous class data;

the CL method determines the value of an attribute by measuring the correlation between the attribute and a class, and a nominal attribute is considered on the basis of one value, each of which is regarded as an index. The overall correlation value of a nominal attribute is obtained by weight vector averaging;

the GR method evaluates the value of an attribute by measuring its gain value relative to the class;

the IG method evaluates the weight of an attribute by measuring the information gain of an attribute for a class;

the OR method uses the minimum error attribute to predict and can discretize the numerical attribute;

the SU method evaluates the value of an attribute by measuring its symmetry uncertainty for the class.

The specific steps of step S3 are:

s3.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;

s3.2, selecting a feature selection method fs from the set FSelection;

s3.3, selecting a feature quantity fn from the feature quantity set;

s3.4, training a data set DATASET based on fs, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;

s3.5, repeating the steps S3.3 to S3.4 until all the feature quantities are selected;

s3.6, repeating the step S3.2 to the step S3.5 until all the feature selection methods are selected;

and S3.7, obtaining an optimal feature selection method BFmethod by comparing F-measure performance parameters.

The specific steps of step S4 are:

s4.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;

s4.2, selecting a characteristic selection method as BFmethod;

s4.3, selecting a feature quantity fn from the feature quantity set;

s4.4, training a data set DATASET based on BFmethod, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;

s4.5, repeating the steps S4.3 to S4.4 until all the feature quantities are selected;

and S4.6, comparing the F-measure performance parameters to obtain the optimal feature quantity FThreshold.

The specific steps of step S5 are:

s5.1, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，...，X_nWhere i ═ 1, 2.., n, target items Y, X_i′＝log(1+X_i)，Y′＝log(1+Y)；Mean(X_i') is X_i' a vector consisting of the average of all feature metric values; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.2, constructing a source item selection method std _ log: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i′＝log(1+X_i)，Y′＝log(1+Y)；Std(X_i') is X_i' a vector of standard deviations of all feature metric values; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.3, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i′＝log(1+X_i)，Y′＝log(1+Y)；Median(X_i') is X_i' a vector consisting of the median of all the feature metrics; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.4, constructing a source project selection method media _ zscore: for a given set of source items { X₁，X₂，...，X_n}, target itemY，X_i′＝zscore X_i，Y′＝zscore(Y)；Median(X_i') is X_i' a vector consisting of the median of all the feature metrics; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁_，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data Selection strategy methods (EM-Clusting, New Neighbor Selection) based on similarity as distance;

s5.6, a component source item selection method set SPSelection { mean _ log, std _ log, mean _ zscore, TDS }.

The specific steps of step S6 are:

s6.1, selecting a method from the source project selection method set SPselection in the step S5 for testing;

s6.2, under the characteristic number of FThreshmod, calculating the prediction and evaluation effects between the source project and other projects;

s6.3, calculating the average value of the prediction results under the same source item selection method;

s6.4, repeating the step S6.1 to the step S6.3 until all the source project selection methods are tested;

s6.5, comparing the average values of the prediction results to obtain an optimal source item selection method;

and S6.6, obtaining a cross-project defect prediction method CPSPM.

In the research of software prediction, an F-measure index is widely used for measuring the efficiency of a feature method. And the index uses two parameters of Precision and Recall.

Precision indicates the percentage of all instances that the number of instances that are correctly divided into clean is. Where TP represents the number of modules that predict defective modules as defective modules, TN represents the number of modules that predict non-defective modules as non-defective modules, FP represents the number of modules that predict non-defective modules as defective modules, and FN represents the number of modules that predict defective modules as non-defective modules.

Recall indicates the percentage of the number of defective modules into which the instance is correctly divided to all defective modules. The higher the value, the higher the probability that the model can correctly identify the defect, and the more defective modules can be identified.

The Accuracy of the model classification is higher when the proportion of the number of correctly divided modules in the total number of modules is higher, and the Accuracy is lower when the proportion is higher.

F-measure is a composite method of two measurement parameters of P and CRR. The higher the value, the better the method performs.

The value of the F-measure is between 0 and 1, and the higher the value is, the better the model performance is.

The technical scheme of the invention has the following beneficial effects:

the cross-project software defect prediction method based on source selection provided by the invention provides a plurality of source project selection methods, and selects the source projects by combining the corresponding characteristic selection methods, so that the method is favorable for greatly improving the software defect prediction effect.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of a method for constructing feature selection sets and a method for selecting source items in accordance with the present invention;

FIG. 3 is an F-measure image obtained by six feature selection methods according to the present invention;

FIG. 4 is a graph showing the results obtained by using the RF method according to the present invention;

FIG. 5 is an Accuracy graph obtained using different source item selection techniques in the present invention;

FIG. 6 is a diagram of F-measure obtained using different source item selection techniques in the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the present invention provides a cross-project software defect prediction method based on source selection, which is mainly used for optimizing software defect performance, and comprises the following steps:

s1, constructing a data set;

s2, constructing a feature selection method set FSelection;

s3, obtaining an optimal feature selection method BFmethod;

s4, obtaining an optimal feature quantity FThreshold;

s5, constructing a source project selection method set SPselection;

Step S1, the specific steps of data set construction are as follows:

taking a premium dataset as an example, selecting items from the premium dataset to test, wherein the dataset mainly comprises the contents of several aspects such as the name of the dataset, the number of defective modules, the total number of modules, the number of module features, the percentage of error examples in the total number of examples, and the like.

Step S2, the specific steps of constructing the feature selection method set FSelection are as follows:

FSelection＝{RF，CL，GR，IG，OR，SU}。

the RF method evaluates the value of an attribute by iteratively sampling one instance and taking the value of a given attribute into account in the most recent instances of the same class and different classes. It can manipulate both discrete and continuous class data;

the CL method determines the value of an attribute by measuring the correlation between the attribute and a class. The nominal attribute is considered on the basis of one value, each value being considered as an index. The overall correlation value of a nominal attribute is obtained by weight vector averaging;

The construction process is shown in FIG. 2.

Step S3, the method for obtaining the optimal feature selection BFMethod includes the following steps:

under the same feature number range, the six methods are respectively tested for effects, and the test results are shown in fig. 3. As can be seen from the figure, the performance of the six methods is greatly different when the number of features is small, and the performance of the six methods is nearly uniform when the number of features is greater than 14.

Based on the evaluation of the effects of the above six feature selection methods, the present invention finally uses the RF method as the feature selection method.

Step S4, the specific steps of obtaining the optimal feature quantity FThreshold are as follows:

initially, the feature number starts at 1, the step size is 1, and 1 feature is selected as the feature number set { α + β.

Selecting a feature selection method fs from the set FSelection, selecting a feature quantity fn from the feature quantity set, training a data set based on fs, fn and a logistic regression classification algorithm, and obtaining an Accuracy value and an F-measure value. As can be seen from FIG. 4, the F-measure value increases with increasing feature value, eventually approaching 0.3; starting with a eigenvalue of 2, the Accuracy value also increases with increasing eigenvalue, eventually floating around 0.6.

Step S5, the specific steps of constructing the source item selection method set SPSelection are as follows:

s5.1, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，...，X_nWhere i ═ 1, 2.., n, target items Y, X_i＝log(1+X_i)，Y＝log(1+Y)。Mean(X_i) Is X_iThe average of all the feature metric values constitutes a vector. Dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item.

S5.2, constructing a source item selection method std _ log: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i＝log(1+X_i)，Y＝log(1+Y)。Std(X_i) Is X_iThe standard deviation of all the feature metric values constitutes a vector. Dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，..，Dist(X_n', Y') }, then X_jIs selected as the source item.

S5.3, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i＝log(1+X_i)，Y＝log(1+Y)。Median(X_i) Is X_i' vector consisting of median of all feature metric values. Dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j′，Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item.

S5.4, constructing a source project selection method media _ zscore: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i＝zscore X_i，Y＝zscore(Y)。Median(X_i) Is X_i' vector consisting of median of all feature metric values. Dist (X)_iY) is X_iAnd Y, if Dist (X)_j', Y') is { Dist (X)₁′，Y′)，Dist(X₂′，Y′)，...，Dist(X_n', Y') }, then X_jIs selected as the source item.

S5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data Selection strategy methods (EM-Clusting, New Neighbor Selection) based on similarity as distance.

S6, the concrete steps of constructing the cross-project defect prediction method CPSPM based on source selection are as follows:

the Accuracy index obtained using these four source item selection methods is shown in fig. 5, while comparing the effects in combination with the TDS method. It can be seen from the figure that, in the case of a small number of feature values, the Accuracy values obtained based on the TDS, std _ log and mean _ log methods are all smaller than those without the selection technique; the Accuracy values obtained with these four methods tend to be stable as the number of features increases, comparable to the values obtained without the use of selection techniques.

The F-measure indexes obtained by using the four source item selection methods are shown in FIG. 6, and when the characteristic value is less than 9, the value obtained by only the mean _ log method is greater than that obtained by a method without adopting a selection technology; under the condition that the characteristic value is gradually increased, the value obtained by the TDS method is gradually increased, and when the characteristic value is 9, the obtained F-measure value is higher than that obtained by the method without adopting the selection technology for the first time. When the characteristic value is 20, the F-measure obtained by the other methods except the medium _ zscore method is higher than that obtained by the method without using the selection technology, but the medium _ zscore method is superior to other methods when the characteristic value is between 6 and 12.

The invention provides four different source item selection methods, firstly, six characteristic selection methods are adopted to obtain an RF characteristic selection method with better effect; the method is used for carrying out the next source project selection operation, and in the step, the Accuracy and the F-measure indexes are respectively used for carrying out the evaluation of the method. From the results obtained, both indexes increase with the increase of the number of the eigenvalues, and when the eigenvalue is 20, the method other than the mean _ zscore method is superior to the method without using the selection technique, but from the experimental point of view, the method is still the best source item selection method.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A cross-project software defect prediction method based on source selection is characterized by comprising the following steps:

s1, constructing a data set;

s2, constructing a feature selection method set FSelection;

s3, obtaining an optimal feature selection method BFmethod;

s4, obtaining an optimal feature quantity FThreshold;

s5, constructing a source project selection method set SPselection;

2. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S1 are:

s1.1, acquiring a software project set based on an open source website;

3. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S2 are:

FSelection＝{RF，CL，GR，IG，OR，SU}；

4. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S3 are:

s3.2, selecting a feature selection method fs from the set FSelection;

s3.3, selecting a feature quantity fn from the feature quantity set;

5. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S4 are:

s4.2, selecting a characteristic selection method as BFmethod;

s4.3, selecting a feature quantity fn from the feature quantity set;

6. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S5 are:

s5.1, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，...，X_nWhere i ═ 1, 2.., n, target items Y, X_i＝log(1+X_i)，Y’＝log(1+Y)；Mean(X_i') is X_i' a vector consisting of the average of all feature metric values; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁’，Y’)，Dist(X₂’，Y)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.2, constructing a source item selection method std _ log: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i’＝log(1+X_i)，Y’＝log(1+Y)；Std(X_i') is X_i' a vector of standard deviations of all feature metric values; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁’，Y’)，Dist(X₂’，Y’)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.3, constructing a source project selection method mean _ log: for a given set of source items { X₁，X₂，…，X_n}, target item Y, X_i’＝log(1+X_i)，Y’＝log(1+Y)；Median(X_i') is X_i' a vector consisting of the median of all the feature metrics; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁’，Y’)，Dist(X₂’，Y’)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.4, constructing a source project selection method media _ zscore: for a given set of source items { X₁，X₂，...，X_n}, target item Y, X_i’＝zscore X_i，Y’＝zscore(Y)；Median(X_i') is X_i' a vector consisting of the median of all the feature metrics; dist (X)_i', Y') is X_iEuclidean distance between 'and Y', if Dist (X)_j', Y') is { Dist (X)₁’，Y’)，Dist(X₂’，Y’)，...，Dist(X_n', Y') }, then X_jIs selected as the source item;

s5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data selection strategy methods based on similarity as distance;

7. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S6 are:

and S6.6, obtaining a cross-project defect prediction method CPSPM.