CN113176998A - Cross-project software defect prediction method based on source selection - Google Patents

Cross-project software defect prediction method based on source selection Download PDF

Info

Publication number
CN113176998A
CN113176998A CN202110503077.0A CN202110503077A CN113176998A CN 113176998 A CN113176998 A CN 113176998A CN 202110503077 A CN202110503077 A CN 202110503077A CN 113176998 A CN113176998 A CN 113176998A
Authority
CN
China
Prior art keywords
source
dist
project
selection
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110503077.0A
Other languages
Chinese (zh)
Inventor
文万志
张瑞年
朱宁波
陈义
尹思文
李元金
程实
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202110503077.0A priority Critical patent/CN113176998A/en
Publication of CN113176998A publication Critical patent/CN113176998A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention provides a cross-project software defect prediction method based on source selection, which comprises the following steps: s1, constructing a data set; s2, constructing a feature selection method set FSelection; s3, obtaining an optimal feature selection method BFmethod; s4, obtaining an optimal feature quantity FThreshold; s5, constructing a source project selection method set SPselection; s6, constructing a cross-project defect prediction method CPSPM based on source selection. The invention provides a method for selecting various source items, which can provide better source items for subsequent data training and can effectively improve the efficiency of software defect prediction.

Description

Cross-project software defect prediction method based on source selection
Technical Field
The invention belongs to the technical field of software defect prediction, and particularly relates to a cross-project software defect prediction method based on source selection, which is mainly used for optimizing the data set quality in the aspect of source project selection and further improving a software defect prediction result.
Background
In the process of rapid development of software development, developers can invisibly generate some software errors in the development process, and the software errors are software defects. The hidden danger of the software defect is very large, the hidden danger not only can affect the use experience and the software quality of a user, but also can endanger social security, so that a potential software defect existing in the software needs to be discovered earlier.
Software defect prediction can help software developers effectively predict potential defects of software, and recently, researchers have proposed a plurality of methods which are mainly used for improving the result of software defect prediction. However, software prediction across projects is extremely difficult, mainly because the data distribution difference between the source project and the target project is large, and the prediction effect is poor.
Disclosure of Invention
The invention aims to provide a cross-project software defect prediction method based on source selection, which improves the accuracy of cross-project software defect prediction, can effectively assist software developers to use the prediction model to reduce defects in the software development process, and has higher accuracy and efficiency.
To solve the above technical problem, an embodiment of the present invention provides a cross-project software defect prediction method based on source selection, including the following steps:
s1, constructing a data set;
s2, constructing a feature selection method set FSelection;
s3, obtaining an optimal feature selection method BFmethod;
s4, obtaining an optimal feature quantity FThreshold;
s5, constructing a source project selection method set SPselection;
s6, constructing a cross-project defect prediction method CPSPM based on source selection.
The specific steps of step S1 are:
s1.1, acquiring a software project set based on an open source website;
s1.2, constructing a project instance set by taking a project class as an instance;
s1.3, constructing feature sets { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBM, AMC, Ca, Ce, Max _ CC, Avg _ CC, LOC } based on open source data history records, project source code syntactic structures and source code abstract syntax trees;
wherein WMC represents a weighted method for each class; DIT represents the depth of the inheritance tree; NOC represents the number of subclasses; CBO represents the coupling between object classes; RFC stands for a class of responses; LCOM and LCOM3 represent the lack of cohesion on the process; NPM represents the number of public classes; DAM represents a data access index; MOA represents a measure of polymerization; MFA represents a measure of functional abstraction; CAM represents an aggregation between class methods; IC stands for legacy coupling; CBW represents the coupling between methods; AMC represents the average method complexity; ca represents afferent coupling; ce stands for outgoing coupling; max _ CC represents the maximum value of McCabe circle complexity; avg _ CC represents the average value of McCabe circle complexity; LOC represents the number of lines of the code;
s1.4, forming a defect prediction data set DATASET based on the examples and the characteristics.
The specific steps of step S2 are:
FSelection={RF,CL,GR,IG,OR,SU};
wherein the RF method evaluates the value of an attribute by iteratively sampling an instance and taking into account the value of the given attribute in the most recent instances of the same class and of different classes, it can operate on discrete and continuous class data;
the CL method determines the value of an attribute by measuring the correlation between the attribute and a class, and a nominal attribute is considered on the basis of one value, each of which is regarded as an index. The overall correlation value of a nominal attribute is obtained by weight vector averaging;
the GR method evaluates the value of an attribute by measuring its gain value relative to the class;
the IG method evaluates the weight of an attribute by measuring the information gain of an attribute for a class;
the OR method uses the minimum error attribute to predict and can discretize the numerical attribute;
the SU method evaluates the value of an attribute by measuring its symmetry uncertainty for the class.
The specific steps of step S3 are:
s3.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;
s3.2, selecting a feature selection method fs from the set FSelection;
s3.3, selecting a feature quantity fn from the feature quantity set;
s3.4, training a data set DATASET based on fs, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;
s3.5, repeating the steps S3.3 to S3.4 until all the feature quantities are selected;
s3.6, repeating the step S3.2 to the step S3.5 until all the feature selection methods are selected;
and S3.7, obtaining an optimal feature selection method BFmethod by comparing F-measure performance parameters.
The specific steps of step S4 are:
s4.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;
s4.2, selecting a characteristic selection method as BFmethod;
s4.3, selecting a feature quantity fn from the feature quantity set;
s4.4, training a data set DATASET based on BFmethod, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;
s4.5, repeating the steps S4.3 to S4.4 until all the feature quantities are selected;
and S4.6, comparing the F-measure performance parameters to obtain the optimal feature quantity FThreshold.
The specific steps of step S5 are:
s5.1, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,...,XnWhere i ═ 1, 2.., n, target items Y, Xi′=log(1+Xi),Y′=log(1+Y);Mean(Xi') is Xi' a vector consisting of the average of all feature metric values; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.2, constructing a source item selection method std _ log: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi′=log(1+Xi),Y′=log(1+Y);Std(Xi') is Xi' a vector of standard deviations of all feature metric values; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.3, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi′=log(1+Xi),Y′=log(1+Y);Median(Xi') is Xi' a vector consisting of the median of all the feature metrics; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.4, constructing a source project selection method media _ zscore: for a given set of source items { X1,X2,...,Xn}, target itemY,Xi′=zscore Xi,Y′=zscore(Y);Median(Xi') is Xi' a vector consisting of the median of all the feature metrics; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1_,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data Selection strategy methods (EM-Clusting, New Neighbor Selection) based on similarity as distance;
s5.6, a component source item selection method set SPSelection { mean _ log, std _ log, mean _ zscore, TDS }.
The specific steps of step S6 are:
s6.1, selecting a method from the source project selection method set SPselection in the step S5 for testing;
s6.2, under the characteristic number of FThreshmod, calculating the prediction and evaluation effects between the source project and other projects;
s6.3, calculating the average value of the prediction results under the same source item selection method;
s6.4, repeating the step S6.1 to the step S6.3 until all the source project selection methods are tested;
s6.5, comparing the average values of the prediction results to obtain an optimal source item selection method;
and S6.6, obtaining a cross-project defect prediction method CPSPM.
In the research of software prediction, an F-measure index is widely used for measuring the efficiency of a feature method. And the index uses two parameters of Precision and Recall.
Precision indicates the percentage of all instances that the number of instances that are correctly divided into clean is. Where TP represents the number of modules that predict defective modules as defective modules, TN represents the number of modules that predict non-defective modules as non-defective modules, FP represents the number of modules that predict non-defective modules as defective modules, and FN represents the number of modules that predict defective modules as non-defective modules.
Figure BDA0003057166780000071
Recall indicates the percentage of the number of defective modules into which the instance is correctly divided to all defective modules. The higher the value, the higher the probability that the model can correctly identify the defect, and the more defective modules can be identified.
Figure BDA0003057166780000072
The Accuracy of the model classification is higher when the proportion of the number of correctly divided modules in the total number of modules is higher, and the Accuracy is lower when the proportion is higher.
Figure BDA0003057166780000073
F-measure is a composite method of two measurement parameters of P and CRR. The higher the value, the better the method performs.
Figure BDA0003057166780000081
The value of the F-measure is between 0 and 1, and the higher the value is, the better the model performance is.
The technical scheme of the invention has the following beneficial effects:
the cross-project software defect prediction method based on source selection provided by the invention provides a plurality of source project selection methods, and selects the source projects by combining the corresponding characteristic selection methods, so that the method is favorable for greatly improving the software defect prediction effect.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of a method for constructing feature selection sets and a method for selecting source items in accordance with the present invention;
FIG. 3 is an F-measure image obtained by six feature selection methods according to the present invention;
FIG. 4 is a graph showing the results obtained by using the RF method according to the present invention;
FIG. 5 is an Accuracy graph obtained using different source item selection techniques in the present invention;
FIG. 6 is a diagram of F-measure obtained using different source item selection techniques in the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the present invention provides a cross-project software defect prediction method based on source selection, which is mainly used for optimizing software defect performance, and comprises the following steps:
s1, constructing a data set;
s2, constructing a feature selection method set FSelection;
s3, obtaining an optimal feature selection method BFmethod;
s4, obtaining an optimal feature quantity FThreshold;
s5, constructing a source project selection method set SPselection;
s6, constructing a cross-project defect prediction method CPSPM based on source selection.
Step S1, the specific steps of data set construction are as follows:
taking a premium dataset as an example, selecting items from the premium dataset to test, wherein the dataset mainly comprises the contents of several aspects such as the name of the dataset, the number of defective modules, the total number of modules, the number of module features, the percentage of error examples in the total number of examples, and the like.
Step S2, the specific steps of constructing the feature selection method set FSelection are as follows:
FSelection={RF,CL,GR,IG,OR,SU}。
the RF method evaluates the value of an attribute by iteratively sampling one instance and taking the value of a given attribute into account in the most recent instances of the same class and different classes. It can manipulate both discrete and continuous class data;
the CL method determines the value of an attribute by measuring the correlation between the attribute and a class. The nominal attribute is considered on the basis of one value, each value being considered as an index. The overall correlation value of a nominal attribute is obtained by weight vector averaging;
the GR method evaluates the value of an attribute by measuring its gain value relative to the class;
the IG method evaluates the weight of an attribute by measuring the information gain of an attribute for a class;
the OR method uses the minimum error attribute to predict and can discretize the numerical attribute;
the SU method evaluates the value of an attribute by measuring its symmetry uncertainty for the class.
The construction process is shown in FIG. 2.
Step S3, the method for obtaining the optimal feature selection BFMethod includes the following steps:
under the same feature number range, the six methods are respectively tested for effects, and the test results are shown in fig. 3. As can be seen from the figure, the performance of the six methods is greatly different when the number of features is small, and the performance of the six methods is nearly uniform when the number of features is greater than 14.
Based on the evaluation of the effects of the above six feature selection methods, the present invention finally uses the RF method as the feature selection method.
Step S4, the specific steps of obtaining the optimal feature quantity FThreshold are as follows:
initially, the feature number starts at 1, the step size is 1, and 1 feature is selected as the feature number set { α + β.
Selecting a feature selection method fs from the set FSelection, selecting a feature quantity fn from the feature quantity set, training a data set based on fs, fn and a logistic regression classification algorithm, and obtaining an Accuracy value and an F-measure value. As can be seen from FIG. 4, the F-measure value increases with increasing feature value, eventually approaching 0.3; starting with a eigenvalue of 2, the Accuracy value also increases with increasing eigenvalue, eventually floating around 0.6.
Step S5, the specific steps of constructing the source item selection method set SPSelection are as follows:
s5.1, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,...,XnWhere i ═ 1, 2.., n, target items Y, Xi=log(1+Xi),Y=log(1+Y)。Mean(Xi) Is XiThe average of all the feature metric values constitutes a vector. Dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item.
S5.2, constructing a source item selection method std _ log: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi=log(1+Xi),Y=log(1+Y)。Std(Xi) Is XiThe standard deviation of all the feature metric values constitutes a vector. Dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),..,Dist(Xn', Y') }, then XjIs selected as the source item.
S5.3, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi=log(1+Xi),Y=log(1+Y)。Median(Xi) Is Xi' vector consisting of median of all feature metric values. Dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j′,Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item.
S5.4, constructing a source project selection method media _ zscore: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi=zscore Xi,Y=zscore(Y)。Median(Xi) Is Xi' vector consisting of median of all feature metric values. Dist (X)iY) is XiAnd Y, if Dist (X)j', Y') is { Dist (X)1′,Y′),Dist(X2′,Y′),...,Dist(Xn', Y') }, then XjIs selected as the source item.
S5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data Selection strategy methods (EM-Clusting, New Neighbor Selection) based on similarity as distance.
S5.6, a component source item selection method set SPSelection { mean _ log, std _ log, mean _ zscore, TDS }.
S6, the concrete steps of constructing the cross-project defect prediction method CPSPM based on source selection are as follows:
the Accuracy index obtained using these four source item selection methods is shown in fig. 5, while comparing the effects in combination with the TDS method. It can be seen from the figure that, in the case of a small number of feature values, the Accuracy values obtained based on the TDS, std _ log and mean _ log methods are all smaller than those without the selection technique; the Accuracy values obtained with these four methods tend to be stable as the number of features increases, comparable to the values obtained without the use of selection techniques.
The F-measure indexes obtained by using the four source item selection methods are shown in FIG. 6, and when the characteristic value is less than 9, the value obtained by only the mean _ log method is greater than that obtained by a method without adopting a selection technology; under the condition that the characteristic value is gradually increased, the value obtained by the TDS method is gradually increased, and when the characteristic value is 9, the obtained F-measure value is higher than that obtained by the method without adopting the selection technology for the first time. When the characteristic value is 20, the F-measure obtained by the other methods except the medium _ zscore method is higher than that obtained by the method without using the selection technology, but the medium _ zscore method is superior to other methods when the characteristic value is between 6 and 12.
The invention provides four different source item selection methods, firstly, six characteristic selection methods are adopted to obtain an RF characteristic selection method with better effect; the method is used for carrying out the next source project selection operation, and in the step, the Accuracy and the F-measure indexes are respectively used for carrying out the evaluation of the method. From the results obtained, both indexes increase with the increase of the number of the eigenvalues, and when the eigenvalue is 20, the method other than the mean _ zscore method is superior to the method without using the selection technique, but from the experimental point of view, the method is still the best source item selection method.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A cross-project software defect prediction method based on source selection is characterized by comprising the following steps:
s1, constructing a data set;
s2, constructing a feature selection method set FSelection;
s3, obtaining an optimal feature selection method BFmethod;
s4, obtaining an optimal feature quantity FThreshold;
s5, constructing a source project selection method set SPselection;
s6, constructing a cross-project defect prediction method CPSPM based on source selection.
2. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S1 are:
s1.1, acquiring a software project set based on an open source website;
s1.2, constructing a project instance set by taking a project class as an instance;
s1.3, constructing feature sets { WMC, DIT, NOC, CBO, RFC, LCOM, LCOM3, NPM, DAM, MOA, MFA, CAM, IC, CBM, AMC, Ca, Ce, Max _ CC, Avg _ CC, LOC } based on open source data history records, project source code syntactic structures and source code abstract syntax trees;
wherein WMC represents a weighted method for each class; DIT represents the depth of the inheritance tree; NOC represents the number of subclasses; CBO represents the coupling between object classes; RFC stands for a class of responses; LCOM and LCOM3 represent the lack of cohesion on the process; NPM represents the number of public classes; DAM represents a data access index; MOA represents a measure of polymerization; MFA represents a measure of functional abstraction; CAM represents an aggregation between class methods; IC stands for legacy coupling; CBW represents the coupling between methods; AMC represents the average method complexity; ca represents afferent coupling; ce stands for outgoing coupling; max _ CC represents the maximum value of McCabe circle complexity; avg _ CC represents the average value of McCabe circle complexity; LOC represents the number of lines of the code;
s1.4, forming a defect prediction data set DATASET based on the examples and the characteristics.
3. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S2 are:
FSelection={RF,CL,GR,IG,OR,SU};
wherein the RF method evaluates the value of an attribute by iteratively sampling an instance and taking into account the value of the given attribute in the most recent instances of the same class and of different classes, it can operate on discrete and continuous class data;
the CL method determines the value of an attribute by measuring the correlation between the attribute and a class, and a nominal attribute is considered on the basis of one value, each of which is regarded as an index. The overall correlation value of a nominal attribute is obtained by weight vector averaging;
the GR method evaluates the value of an attribute by measuring its gain value relative to the class;
the IG method evaluates the weight of an attribute by measuring the information gain of an attribute for a class;
the OR method uses the minimum error attribute to predict and can discretize the numerical attribute;
the SU method evaluates the value of an attribute by measuring its symmetry uncertainty for the class.
4. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S3 are:
s3.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;
s3.2, selecting a feature selection method fs from the set FSelection;
s3.3, selecting a feature quantity fn from the feature quantity set;
s3.4, training a data set DATASET based on fs, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;
s3.5, repeating the steps S3.3 to S3.4 until all the feature quantities are selected;
s3.6, repeating the step S3.2 to the step S3.5 until all the feature selection methods are selected;
and S3.7, obtaining an optimal feature selection method BFmethod by comparing F-measure performance parameters.
5. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S4 are:
s4.1, constructing a feature quantity set, wherein initially, the feature quantity starts from alpha, the step length is beta, and gamma features are selected as the feature quantity set { alpha + beta, … alpha + gamma + beta }, wherein alpha + gamma beta is equal to the total number of features 20;
s4.2, selecting a characteristic selection method as BFmethod;
s4.3, selecting a feature quantity fn from the feature quantity set;
s4.4, training a data set DATASET based on BFmethod, fn and a logistic regression classification algorithm, and obtaining F-measure performance parameters;
s4.5, repeating the steps S4.3 to S4.4 until all the feature quantities are selected;
and S4.6, comparing the F-measure performance parameters to obtain the optimal feature quantity FThreshold.
6. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S5 are:
s5.1, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,...,XnWhere i ═ 1, 2.., n, target items Y, Xi=log(1+Xi),Y’=log(1+Y);Mean(Xi') is Xi' a vector consisting of the average of all feature metric values; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1’,Y’),Dist(X2’,Y),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.2, constructing a source item selection method std _ log: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi’=log(1+Xi),Y’=log(1+Y);Std(Xi') is Xi' a vector of standard deviations of all feature metric values; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1’,Y’),Dist(X2’,Y’),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.3, constructing a source project selection method mean _ log: for a given set of source items { X1,X2,…,Xn}, target item Y, Xi’=log(1+Xi),Y’=log(1+Y);Median(Xi') is Xi' a vector consisting of the median of all the feature metrics; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1’,Y’),Dist(X2’,Y’),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.4, constructing a source project selection method media _ zscore: for a given set of source items { X1,X2,...,Xn}, target item Y, Xi’=zscore Xi,Y’=zscore(Y);Median(Xi') is Xi' a vector consisting of the median of all the feature metrics; dist (X)i', Y') is XiEuclidean distance between 'and Y', if Dist (X)j', Y') is { Dist (X)1’,Y’),Dist(X2’,Y’),...,Dist(Xn', Y') }, then XjIs selected as the source item;
s5.5, constructing a source project selection method TDS: the method selects data through the distribution characteristics of the data, and provides two training data selection strategy methods based on similarity as distance;
s5.6, a component source item selection method set SPSelection { mean _ log, std _ log, mean _ zscore, TDS }.
7. The method for cross-project software defect prediction based on source selection as claimed in claim 1, wherein the specific steps of step S6 are:
s6.1, selecting a method from the source project selection method set SPselection in the step S5 for testing;
s6.2, under the characteristic number of FThreshmod, calculating the prediction and evaluation effects between the source project and other projects;
s6.3, calculating the average value of the prediction results under the same source item selection method;
s6.4, repeating the step S6.1 to the step S6.3 until all the source project selection methods are tested;
s6.5, comparing the average values of the prediction results to obtain an optimal source item selection method;
and S6.6, obtaining a cross-project defect prediction method CPSPM.
CN202110503077.0A 2021-05-10 2021-05-10 Cross-project software defect prediction method based on source selection Pending CN113176998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110503077.0A CN113176998A (en) 2021-05-10 2021-05-10 Cross-project software defect prediction method based on source selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110503077.0A CN113176998A (en) 2021-05-10 2021-05-10 Cross-project software defect prediction method based on source selection

Publications (1)

Publication Number Publication Date
CN113176998A true CN113176998A (en) 2021-07-27

Family

ID=76928591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110503077.0A Pending CN113176998A (en) 2021-05-10 2021-05-10 Cross-project software defect prediction method based on source selection

Country Status (1)

Country Link
CN (1) CN113176998A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510431A (en) * 2022-04-20 2022-05-17 武汉理工大学 Workload-aware intelligent contract defect prediction method, system and equipment
CN114924962A (en) * 2022-05-17 2022-08-19 北京航空航天大学 Cross-project software defect prediction data selection method
CN115269378A (en) * 2022-06-23 2022-11-01 南通大学 Cross-project software defect prediction method based on domain feature distribution
CN115269377A (en) * 2022-06-23 2022-11-01 南通大学 Cross-project software defect prediction method based on optimization instance selection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391369A (en) * 2017-07-13 2017-11-24 武汉大学 A kind of spanned item mesh failure prediction method based on data screening and data oversampling
US20190265970A1 (en) * 2018-02-28 2019-08-29 Fujitsu Limited Automatic identification of relevant software projects for cross project learning
CN111581116A (en) * 2020-06-16 2020-08-25 江苏师范大学 Cross-project software defect prediction method based on hierarchical data screening
CN111966586A (en) * 2020-08-05 2020-11-20 南通大学 Cross-project defect prediction method based on module selection and weight updating

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391369A (en) * 2017-07-13 2017-11-24 武汉大学 A kind of spanned item mesh failure prediction method based on data screening and data oversampling
US20190265970A1 (en) * 2018-02-28 2019-08-29 Fujitsu Limited Automatic identification of relevant software projects for cross project learning
CN111581116A (en) * 2020-06-16 2020-08-25 江苏师范大学 Cross-project software defect prediction method based on hierarchical data screening
CN111966586A (en) * 2020-08-05 2020-11-20 南通大学 Cross-project defect prediction method based on module selection and weight updating

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANZHI WEN 等: "An Empirical Study on Combining Source Selection and Transfer Learning for Cross-Project Defect Prediction", 《2019 IEEE 1ST INTERNATIONAL WORKSHOP ON INTELLIGENT BUG FIXING (IBF)》 *
王莉萍: "基于实例选择的集成跨项目缺陷预测方法的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510431A (en) * 2022-04-20 2022-05-17 武汉理工大学 Workload-aware intelligent contract defect prediction method, system and equipment
CN114924962A (en) * 2022-05-17 2022-08-19 北京航空航天大学 Cross-project software defect prediction data selection method
CN114924962B (en) * 2022-05-17 2024-05-31 北京航空航天大学 Cross-project software defect prediction data selection method
CN115269378A (en) * 2022-06-23 2022-11-01 南通大学 Cross-project software defect prediction method based on domain feature distribution
CN115269377A (en) * 2022-06-23 2022-11-01 南通大学 Cross-project software defect prediction method based on optimization instance selection
CN115269378B (en) * 2022-06-23 2023-06-09 南通大学 Cross-project software defect prediction method based on domain feature distribution

Similar Documents

Publication Publication Date Title
CN113176998A (en) Cross-project software defect prediction method based on source selection
Wang et al. Input feature selection method based on feature set equivalence and mutual information gain maximization
Wang et al. Truth discovery via exploiting implications from multi-source data
US20200257731A1 (en) Disambiguation of massive graph databases
CN114564410A (en) Software defect prediction method based on class level source code similarity
CN116226103A (en) Method for detecting government data quality based on FPGrow algorithm
Gao et al. Adapting the TopLeaders algorithm for dynamic social networks
Li et al. A new density peak clustering algorithm based on cluster fusion strategy
Yao et al. An improved clustering algorithm and its application in wechat sports users analysis
Qinl et al. Synthesizing privacy preserving entity resolution datasets
Song et al. On saving outliers for better clustering over noisy data
Malik et al. A comprehensive approach towards data preprocessing techniques & association rules
Li et al. A novel approach to remote sensing image retrieval with multi-feature VP-tree indexing and online feature selection
CN113705920B (en) Method for generating water data sample set for thermal power plant and terminal equipment
Wu et al. Optimization and improvement based on K-Means Cluster algorithm
Li et al. Intelligent fuzzy optimization algorithm of data mining based on BP neural network
Lv et al. Active learning of three-way decision based on neighborhood entropy
CN111652384B (en) Balancing method for data volume distribution and data processing method
CN109086373B (en) Method for constructing fair link prediction evaluation system
Shao et al. Research on Cross‐Company Defect Prediction Method to Improve Software Security
Shao et al. A quantitative measurement method of code quality evaluation indicators based on data mining
Hang et al. A hierarchical clustering algorithm based on K-means with constraints
Gong et al. Diversified and Compatible Web APIs Recommendation in IoT
CN113723835B (en) Water consumption evaluation method and terminal equipment for thermal power plant
Wang et al. Resisting the edge-type disturbance for link prediction in heterogeneous networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination