CN111090579A

CN111090579A - Software defect prediction method based on Pearson correlation weighting association classification rule

Info

Publication number: CN111090579A
Application number: CN201911114620.7A
Authority: CN
Inventors: 王世海; 邵元勋; 刘斌; 严潇波
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-05-01
Anticipated expiration: 2039-11-14
Also published as: CN111090579B

Abstract

The invention discloses a software defect prediction method based on a Classification rule of weighted correlation of Pearson correlation, which comprises the steps of extracting a metadata set of software measurement to be detected according to a corresponding static code analysis tool; evaluating the correlation between each metric element and each category by a feature selection method based on the Pearson correlation, sequencing the correlation, and taking the first 30-50% with larger sequencing value as the selected metric element; substituting the selected measurement elements and the corresponding categories into a software defect prediction model based on the Classification rule of the Pearson correlation weighting correlation, predicting and outputting a prediction result; the method utilizes a valuable, high-performance and understandable rule model to reveal the relevance of the defect tendency and the characteristics, improves the high performance and the understandability of a software defect prediction model, and improves the accuracy of a prediction result.

Description

Software defect prediction method based on Pearson correlation weighting association classification rule

Technical Field

The invention relates to the technical field of software defect prediction, in particular to a software defect prediction method based on a Clason correlation weighted association classification rule.

Background

As the size and complexity of software has increased,ensuring software quality is increasingly important. Software defect prediction is a method for improving software quality, and is an effective means for lightening software code examination and improving test resource allocation. The commonly used software defect prediction method mainly includes classification, regression, clustering and association rule (association rule). The association rule is an algorithm for mining the association relation hidden in the data, and the expression form of the production formula is in accordance with the thinking logic of people, and the expression form of the production formula comprises ifXthenC or

Wherein

And X ∩ C is phi, the front piece X is a set of features (or terms or metrics), the back piece C can be a set of features or categories (e.g., normal and reverse), I is I₁,I₁,···,I_m-1C is a set of terms that contains m features, and whether an association rule is useful is typically measured in terms of support (support) and confidence (confidence), with the greater the support and confidence, the more useful the rule. The support reveals the probability of X occurring simultaneously with C, while the confidence reveals the probability of C occurring at the time X occurs. In the software development life cycle, the interest has been drawn that the association rule-based software defect prediction is beneficial to improve the prediction performance and understand the association between the defect conditions (such as defect tendency, defect type, workload) and the metric.

The software defect prediction based on the association rule mainly comprises data preprocessing, association rule model training and model evaluation. The traditional association rule algorithm Apriori and the mutation algorithm have proved to have higher accuracy (accuracy); due to the fact that software defect prediction data has the characteristics of high dimension and class imbalance, the high dimension means that a data set has a high characteristic dimension, class imbalance means that the number of samples (majority class) of a certain class is far larger than that (minority class) of other classes, association rules are prone to generate a large number of majority class (low risk/defect-free tendency) rules in a model training process, and the minority class (high risk/defect-free tendency) rules are prone to be ignored, so that the majority class has high accuracy, and the minority class has low prediction performance.

Most of the conventional association rule algorithms rely on support degree and confidence degree threshold values, if the support degree and confidence degree threshold values are set too high, a few classes of rules are difficult to mine, and the performance of predicting the few classes is low; if the support and confidence thresholds are set too low, too many rules may be generated, eventually leading to an overfitting phenomenon. Therefore, it is necessary to change the traditional association rule mining and analysis and solve the framework mode which only depends on the support degree and the confidence degree.

In addition, the conventional classical class association rule algorithms (CBA, CMAR, GARC, ECBA, etc.) consider all feature items to have equally important meanings, and do not consider different importance among features. For example, the quality model constructed in the actual data set finds that the code line number and the blank line number have different degrees of influence on the high-risk module respectively. Therefore, the significance of the measure element cannot be ignored, otherwise the discovered knowledge is influenced to have a larger value. Later, a number of weighted association rule mining was proposed, taking into account the importance of different individuals.

The characteristic of the rule-based software defect prediction model different from other irregular models is that the importance of a single attribute characteristic is considered, and the importance of an item set is also considered. The weighted-based associative classification rules turn the center of knowledge discovery towards important terms rather than indiscriminately conducting combinatorial explosion. Attribute features with higher impact are given higher weight and attribute features with lower impact are given lower weight. Thus, high-weighted attribute features will still have a higher priority in the rule set, and lower-weighted attribute features will have a lower priority and will be pruned during the pruning stage.

The domain experience weighting association rule algorithm gives the feature weight through previous experience of subjective cognition, the domain experience method has a good effect when the domain experience weighting association rule algorithm is oriented to a data set with few features, but when the domain experience weighting association rule algorithm is oriented to high-dimensional software defect data, the domain experience method cannot ensure that all the features can give accurate weights, the domain experience weighting association rule has certain subjectivity, the found rules tend to known rule patterns with small values, and therefore hidden knowledge is prevented from being mined. Therefore, the automated weighted association rule algorithms attract a lot of attention, however, the algorithms have the problems of being very sensitive to unbalanced data, only applying sparse data but not applying dense data, and the like.

Therefore, the problem of feature weight distribution of software defect unbalanced data is solved better, the accuracy of prediction is improved, and the problem to be solved by practitioners in the same industry is urgent.

Disclosure of Invention

In view of the above problems, the invention provides a software defect prediction method based on a weighting and association classification rule of the pearson correlation, and a software defect prediction model based on the weighting and association classification rule of the pearson correlation solves the problem that few rules with defects are found when software defect data are faced according to an improved association rule algorithm.

In a first aspect, an embodiment of the present invention provides a software defect prediction method based on a pearson correlation weighted association classification rule, including:

s1, extracting a to-be-detected software measurement metadata set according to a corresponding static code analysis tool;

s2, evaluating the correlation between each metric element and each category by a feature selection method based on the Pearson correlation, sequencing the correlation, and taking the first 30-50% with larger sequencing value as the selected metric element;

and S3, substituting the selected measurement elements and the corresponding categories into a software defect prediction model based on the Classification rule of weighting and associating the Pearson correlation, predicting and outputting a prediction result.

In one embodiment, the software defect prediction model based on the Classification rule weighted by the Pearson correlation is constructed by the following steps:

s31, preprocessing the acquired software defect data to obtain a software defect training set and a software defect testing set; the software defect training set is a defect tendency training set and a defect-free tendency training set;

s32, respectively setting a defect tendency type minimum weighting support degree and a defect-free tendency type minimum weighting support degree, and respectively constructing a defect tendency association rule set and a defect-free tendency association rule set for the defect tendency training set and the defect-free tendency training set by utilizing a weighted Apriori algorithm;

s33, sorting the defective tendency association rule set and the non-defective tendency association rule set;

s34, carrying out rule pruning optimization on the defective tendency association rule set and the non-defective tendency association rule set simultaneously by using a conflict rule pruning method and a redundant rule pruning method;

and S35, predicting the software module by using the optimized defect tendency association rule set and defect-free tendency association rule set.

In one embodiment, the constructing step further comprises:

and S36, verifying the software defect prediction model based on the Classification rule weighted by the Pearson correlation through the test set.

In one embodiment, the step S31 includes:

s311, carrying out first horizontal division on each data set D according to the category, and dividing the data sets into a defective tendency data set TS and a non-defective tendency data set FS, wherein the formulas D and TS ∩ FS are phi;

s312, performing first vertical division on a defective tendency data set TS and a non-defective tendency data set FS into a defective tendency training set TTS, a defective tendency test set Ttest, a non-defective tendency training set FTS and a non-defective tendency test set Ftest, wherein D is TTS ∪ Ttest, TTS ∩ Ttest is phi, D is FTS ∪ Ftest and FTS ∩ Ftest is phi, and the defective tendency training set TTS and the non-defective tendency training set FTS form a complete training set;

s313, evaluating the correlation between each feature and each category of the training set by adopting a feature selection method based on the Pearson correlation, sequencing the correlations, and taking the first 30-50% with larger sequencing value as the selected feature;

s314, discretizing the training set after feature selection by using a 5-order equal frequency method, and averagely dividing the training set into five intervals according to the magnitude sequence of numerical attributes of each feature;

and S315, performing second horizontal division on the discretized training set DS, dividing the discretized training set DS into a discretized defective tendency training set FDS and a discretized nondefective training set TDS, and meeting DS (FDS ∪ TDS) and FDS ∩ TDS (phi).

In one embodiment, the step S312 includes:

and 5-fold cross validation is adopted for evaluation, 4-fold data are selected as training sets respectively, 1-fold data are selected as test sets, data are randomized each time and repeated for 10 times, and the defective tendency data set TS and the non-defective tendency data set FS are vertically divided for the first time.

In one embodiment, the step S32 includes:

s321, calculating the correlation between the selected feature set and the class in the discretized training set by utilizing the Pearson correlation, wherein the feature is a measurement element; the calculation formula is as follows:

wherein n represents the number of features; k represents a feature (0,1,2, …. n); class represents class, defect prone class or defect free prone class; rank (k) represents the relevance of the kth feature;

s322, calculating the weight of the selected feature according to the correlation of the selected feature, wherein the calculation formula is as follows:

wherein n represents the number of features, k represents the feature (0,1,2, …. n), and weight (k) represents the weight of the feature k;

s323, calculating the weight of the item set according to the characteristic weight, wherein the calculation formula of the weight of the item set is as follows:

weight(itemset)＝weight(X₁)*weight(X₂)···weight(X_k)

＝weight(item₁)*weight(item₂)···weight(item_k) (3)；

s324, item set support is obtained by calculating the probability of simultaneous occurrence of the feature item sets in the training set, item set weight is calculated by a formula (3), and weighting support is calculated by using the item set support and the item set weight, wherein the calculation formula is as follows:

wsupp(r)＝weight(itemset)*support(itemset) (4)；

and S325, respectively mining the weighted candidate item meeting the minimum support degree of the defect tendency class and the weighted candidate item meeting the minimum support degree of the defect-free tendency class. And the weighted candidate item sets not less than the minimum support degree are called frequent item sets, and the frequent item sets respectively form a class association rule that the former part is a characteristic set and the latter part is a class label (defect tendency class or defect-free tendency class). Since confidence in this document indicates the probability of a defective or non-defective class occurring in the event of a precondition occurrence. In each class of training set, the feature item set is changed, but the class label is constant, that is, the confidence is equal to 1, the confidence cannot evaluate the accuracy of each rule in this document, and the influence of the confidence on the class association rule can be ignored.

In one embodiment, the step S33 includes:

the defective tendency association rule set and the non-defective tendency association rule set are sorted in an order of priority based on the weighted support, the length of the predecessor, and the generation.

In one embodiment, the conflict rule pruning in step S34 is: when the front pieces of the two rules have the same characteristics and the back pieces belong to different categories, both the two rules are removed;

the redundancy rule pruning in the step S34 is: when the front parts of a plurality of rules have inclusion relations, the back parts belong to the same class, and a redundant rule with small weighting support degree is pruned by adopting a weighting support degree-based mode.

In one embodiment, the step S35 includes:

if the sum of the weighted support degrees of the software module meeting the high-risk defect rules is larger than the sum of the weighted support degrees of the software module meeting the low-risk defect-free rules, classifying the software module into a defect tendency class, otherwise classifying the software module into a defect-free tendency class; if none of the rules satisfies the software module, the software module is classified as a defect tendency class, such as the formula:

wherein c represents a defective or non-defective class, r_cIndicating a defective class rule or a non-defective class rule satisfying the condition, and R indicates a defective class rule set or a non-defective class rule set.

In one embodiment, the step S36 includes:

substituting the test set into the software defect prediction model based on the Classification rule of the weighted correlation of the Pearson correlation to obtain an evaluation result;

performing classification evaluation on the evaluation result by using G-mean, Mcc and Balance;

the evaluation indexes of G-mean, Mcc and Balance are defined as follows:

wherein TP is the number of classes with defects classified as defective, FN is the number of classes with defects classified as non-defective, FP is the number of classes with defects classified as defective, TN is the number of classes with defects classified as non-defective;

and comparing the evaluation indexes of the evaluation results of the software defect prediction model based on the Clason correlation weighting association classification rule with the evaluation indexes of the software defect prediction model based on the Clason correlation weighting association classification rule and the prediction model of the classical association rule algorithm.

The invention provides a software defect prediction method based on a Classification rule of weighted correlation of Pearson correlation, which distributes different weights to measurement elements by using a weighting method based on the Pearson correlation, wherein the weights calculate the correlation according to the characteristics of a sample, thereby avoiding the subjectivity of manual setting; the software defect prediction model based on the Classification rule of the weighted correlation of the Pearson correlation improves the accuracy of the prediction result.

Furthermore, an automatic feature weighting method insensitive to unbalanced data is provided by constructing a software defect prediction framework facing to unbalanced data and based on an associated classification rule, and is combined with the associated classification rule generation, sequencing, pruning and prediction processes to form a valuable, high-performance and understandable rule model, so that the association between defect tendency and a measurement element is revealed, and the high performance and the intelligibility of the software defect prediction model are improved. And the four stages of generation, sorting, pruning and prediction of the association rule are optimized by using the weighting support degree, so that the accuracy of the prediction result is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a software defect prediction method based on Classification rules of weighted correlation with Pearson correlation provided by an embodiment of the present invention;

FIG. 2 is a flow of constructing a software defect prediction model based on Classification rules of weighted correlation with Pearson correlation;

FIG. 3 is a flow chart of a pre-process provided by an embodiment of the present invention;

fig. 4 is a flowchart of step S32 according to an embodiment of the present invention;

fig. 5 is a flowchart of another construction of the software defect prediction model based on the pearson correlation weighting association classification rule according to the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, a software defect prediction method based on a pearson correlation weighting association classification rule provided by an embodiment of the present invention includes:

In this embodiment, the static code analysis tool in step S1 is a code analysis tool that scans the program code by using techniques such as syntax analysis, lexical analysis, and control flow analysis to verify whether the code satisfies the criteria such as normative, security, reliability, and maintainability without executing the program.

In steps S2-S3, feature selection is performed on the metric element data set based on the pearson correlation, the first 30-50% with larger ranking value is used as the selected metric element, and then the selected metric element and the corresponding category are substituted into a software defect prediction model based on the pearson correlation weighted association classification rule to perform prediction and output the prediction result.

The method uses a software defect prediction model based on the Pearson correlation weighting correlation classification rule to carry out prediction, solves the problem that valuable high-risk defect rules cannot be mined when the correlation classification rule is used for constructing the software defect prediction model, considers a multi-support generation mechanism and characteristic weight in the rule generation process, improves the quantity and quality of the valuable high-risk defect rules, has higher prediction performance and intelligibility, and improves the accuracy of the prediction result.

As shown in fig. 2, in one embodiment, the software defect prediction model based on the pearson correlation weighted association classification rule is constructed by the steps of:

s32, respectively setting a defect tendency type minimum weighting support degree and a defect-free tendency type minimum weighting support degree, and respectively constructing a defect tendency association rule set and a defect-free tendency association rule set for the defect tendency training set and the defect-free tendency training set by utilizing a weighted Apriori algorithm; wherein, the defect tendency can be understood as a high risk defect tendency, and the defect-free tendency can be understood as a low risk defect tendency.

In one embodiment, the constructing step further comprises:

and S36, verifying the software defect prediction model based on the Classification rule weighted by the Pearson correlation through a test set.

In the embodiment, according to the characteristics that the association classification rule has good prediction performance and intelligibility, aiming at the problem that the valuable high-risk defect class rule is few under the unbalanced data condition, preprocessing work such as feature selection, data division, discretization and the like is firstly carried out on software defect measurement metadata, and then an association rule-based software defect prediction model is constructed on a preprocessed software defect training set, wherein the software defect prediction model comprises a weighted association rule generation stage, a weighted association rule sorting stage, a weighted association rule pruning stage and a weighted association rule voting stage; and finally, verifying and evaluating the software defect prediction model based on the weighting and associating classification rule of the pearson correlation by using the divided test set. The method solves the problem that valuable high-risk defective rules cannot be mined when the associated classification rules are used for constructing the software defect prediction model, a multi-support generation mechanism and characteristic weights are considered in the rule generation process, the quantity and quality of the valuable high-risk defective rules are improved, and the method has high prediction performance and high understandability.

The method is used for carrying out expansion optimization on the basis of the associated classification rules, and solves the problem that valuable high-risk defective class rules are easy to ignore to a certain extent.

As shown in fig. 3, in an embodiment, the step S31 includes:

s312, performing first vertical division on a defective tendency data set TS and a non-defective tendency data set FS into a defective tendency training set TTS, a defective tendency test set Ttest, a non-defective tendency training set FTS and a non-defective tendency test set Ftest, wherein D is TTS ∪ Ttest, TTS ∩ Ttest is phi, D is FTS ∪ Ftest and FTS ∩ Ftest is phi;

Wherein, step S312 includes:

As shown in fig. 4, in one embodiment, step S32 includes:

s321, calculating the correlation between the selected feature set and the class in the discretized training set by utilizing the Pearson correlation, wherein the calculation formula is as follows:

wherein n represents the number of features; k represents a feature (0,1,2, …. n), class represents class, defect prone class or defect free prone class; rank (k) represents the relevance of the kth feature;

weight(itemset)＝weight(X₁)*weight(X₂)···weight(X_k)

＝weight(item₁)*weight(item₂)···weight(item_k) (3)；

s324, item set support is obtained by calculating the probability of simultaneous occurrence of the feature item sets in the training set, item set weight is calculated by a formula (3), and weighting support is calculated by the item set support and the item set weight, wherein the calculation formula is as follows:

wsupp(r)＝weight(itemset)*support(itemset) (4)；

and S325, respectively mining the weighted candidate item meeting the minimum support degree of the defect tendency class and the weighted candidate item meeting the minimum support degree of the defect-free tendency class. And the weighted candidate item sets not less than the minimum support degree are called frequent item sets, and the frequent item sets respectively form a class association rule that the former part is a characteristic set and the latter part is a class label (defect tendency class or defect-free tendency class). Since the confidence of the present embodiment indicates the probability of occurrence of a defective class or a non-defective class in the case where a antecedent condition occurs. In each class of training set, the feature item set is changed, but the class label is constant, that is, the confidence is equal to 1, in this embodiment, the confidence cannot evaluate the accuracy of each rule, and the influence of the confidence on the class association rule can be ignored.

In this embodiment, in steps S321-S322, the correlation between each feature and each class is calculated by using pearson correlation, and the stronger the correlation between the feature and each class is, the greater the weight given to the feature is, and the correlation between the features is as shown in formula (1):

where n represents the number of features, k represents the feature (0,1,2, …. n), class represents the class (defect or defect-free trend class), and rank (k) represents the correlation of the kth feature.

Since the correlation may be very close to 1 and also very close to 0, the correlation cannot be directly equivalent to the weight, and it can be known from the fact that the weighting support is equal to the product of the importance and the support, when the rule antecedent is more constrained, the weighting support is smaller, which causes a large amount of rule loss, and even if a lower minimum support threshold is set, only a small amount of rules are generated. To accommodate the relevance of different features, the weight of each feature is equal to the mean of the relevance of the feature to the class and the relevance of all features to the class, respectively, such as equation (2)

Wherein n represents the number of features, k represents the feature (0,1,2, …, n), and weight (k) represents the weight of the feature k;

in steps S33-S35, the present invention considers only rules with rule back-parts as categories

The influence of useless rules on a rule generating process is greatly reduced, the traditional association rules are usually measured by two indexes of support degree support and confidence, in the rule generating stage, each type of rule is generated according to the type, P (X ∩ C) is P (X), wherein X represents a feature set, C represents a class label (defective class or non-defective class), the confidence of each rule of each type of training set is calculated to be 1 according to a support degree formula (3) and a confidence degree formula (4),

that is, the confidence coefficient is not affected in the software defect prediction framework, and at this time, only the weighting support degree remains in the index of the measurement rule; therefore, all weighted frequent item sets satisfying the set defect tendency type minimum weighted support degree and defect free tendency type minimum weighted support degree are mined based on the weighted association rule, and then all weighted frequent item sets are combined to generate the weighted association rule.

The term set weight (itemset) is given by equation (3), and it can be seen that when the length of the term set is 1, the weight of the term set is equal to the weight of the term. And the weighted support wsupp of the rule can be derived from equation (4). The method is different from the traditional associated classification rule generation mode and evaluation index, the constant confidence coefficient can not measure the rule any more, and the method adopts the majority class weighting support degree and the minority class weighting support degree as the importance of measuring each rule.

support(X＝＞C)＝P(X∩C)

Where X represents a set of features and C represents a class label (defective or non-defective).

weight(itemset)＝weight(X₁)*weight(X₂)···weight(X_k)

＝weight(item₁)*weight(item₂)···weight(item_k) (3)

Wherein, X_kDenotes the kth feature, weight (X)_k) The weight of the kth feature, which is equivalent to the term item in the present invention, is expressed.

wsupp(r)＝weight(itemset)*support(itemset) (4)

In one embodiment, the step S33 includes:

In the present embodiment, two rules R1 and R2 are assumed, and the rule R1 is said to have a higher priority than the rule R2 when:

if the weighted support of rule R1 is greater than the weighted support of rule R2, then R1 is better than R2;

when the weighted support of rule R1 and rule R2 are equal, R1 is better than R2 if the antecedent length cardinality of rule R1 is greater than the length of rule R2;

when the weighted support, length cardinality, etc. of rule R1 and rule R2 are equal, R1 outperforms R2 if rule R1 is generated earlier than rule R2.

In one embodiment, the conflict rule pruning in step S34 is: when the front parts of the two rules have the same characteristics and the back parts belong to different categories, the two rules are rejected. Such as

And

keeping any one rule has an effect on the other, so in order to avoid such a deviation, we intend to eliminate both rules.

The conflict rule pruning in step S34 is: when the front parts of a plurality of rules have inclusion relations, the back parts belong to the same class, and a redundant rule with small weighting support degree is pruned by adopting a weighting support degree-based mode. Such as

And

rules that typically include more features in the antecedent

Rules that are considered special rules, whereas antecedents contain fewer features

Considering as a generalization rule, a special rule is opposite to a generalization rule. After considering the importance of the feature, the weighted term set and the weighted support degree are not only related to the occurrence frequency but also related to the importance. The weighted support of the particular rule at this time

Greater than generalized rule

The weighting support degree of the method can not simply remove the special rules, and a new mode is needed for pruning. And pruning the redundancy rule with small weighting support degree by adopting a weighting support degree-based mode.

In one embodiment, in step S35:

when a new sample of software module instances is used for defect prediction using the proposed correlation-weighted classification rule based on pearson's correlation, it is possible that there are both defect (or high-risk defect) propensity rules and defect-free propensity (or low-risk defect) propensity rules that satisfy the module. Each weighted association classification rule only contains one index of weighted support degree, so the embodiment of the invention considers the sum of the support degrees of which the weights meet the conditions as the judgment standard.

For a module example sample, if the sum of the weighted support degrees of the high-risk defect rules is larger than the sum of the weighted support degrees of the low-risk defect-free rules, classifying the module as a defect tendency class, otherwise classifying the module as a defect-free tendency class; if none of the rules satisfies the module, then the module is classified as a defective tendency class, such as the formula:

wherein c represents a defective or non-defective class, r_cIndicating a defective class rule or a non-defective class rule satisfying the condition,r denotes a defective class rule set or a non-defective class rule set.

In one embodiment, step S36 includes:

substituting the test set into a software defect prediction model based on a Pearson correlation weighting correlation classification rule to obtain an evaluation result;

carrying out classification evaluation on the evaluation result by utilizing G-mean, Mcc and Balance;

the evaluation indexes of G-mean, Mcc and Balance are defined as follows:

Through the verification of the algorithm provided by the test set, all evaluation indexes of the software defect prediction model are improved, and the tendency capability of identifying a few types of defects is improved.

As shown in fig. 5, in a specific embodiment, the software defect prediction model based on the pearson correlation weighting association classification rule is constructed by the following steps:

the data of the embodiment is derived from the public PROMISE dataset which is composed of a plurality of code features and a class feature and can be downloaded from the tera-premium website (http:// openscience. us/repo/defect/mccabehalsted /). In the invention, 9 object-oriented project-oriented software defect data sets are collected, the number of modules of the data sets is 117 at the minimum and 965 at the maximum, the number of examples is more than 100, the number of features is 21, and the detailed description is given in table 1.

Table 1 description of the software under test

Name of item	Number of modules	Number of code features	Number of defective modules	Rate of defects
					Ant-1.3	187	21	20	0.107
Ant-1.4	178	21	40	0.225
					Ant-1.5	293	21	32	0.109
Ant-1.6	351	21	92	0.262
					Ant-1.7	745	21	166	0.223
Camel-1.0	339	21	13	0.038
					Camel-12	608	21	216	0.355
Camel-1.4	872	21	145	0.166
					Camel-1.6	965	21	188	0.195

In the present invention, 20 feature metrics and a class label are proposed, as shown in table 2.

Table 220 feature metrics and 1 class tag

Firstly, performing feature selection, data division and discretization on each data set; dividing each data set into a defective data subset and a non-defective data subset, randomly dividing the defective data subset and the non-defective data subset into a 4-fold training set and a 1-fold test set respectively, merging the defective data training set and the non-defective data training set into a training set, and merging the defective data test set and the non-defective data test set into a test set. And then, carrying out feature selection on the training set, selecting 30% -50% with a larger ranking value as a final feature subset, and carrying out 5-order equal-frequency discretization on the training set after feature selection. Then, a Pearson correlation method is adopted to automatically acquire the weight of the features, the weight changes along with the change of data, a defective association rule set and a non-defective association rule set are respectively generated by combining a weighted Apriori algorithm, the rule sets are sequenced, pruned and predicted, 10 times of training and testing are carried out, and the final result is averaged.

And finally, carrying out classification evaluation on the result of software defect prediction. And respectively selecting an associated classification rule CBA, a naive Bayes NB, a decision table DT, a random forest RF and a PART algorithm as a reference classifier, and comparing the reference classifier with the proposed algorithm CWCCAR. For the unbalanced software defect data, G-mean, Mcc, and Balance are used as evaluation indexes of each classifier. The three evaluation indices are defined as follows:

TP is the number of classifying positive samples (defective classes) as positive samples, FN is the number of classifying positive samples as negative samples (non-defective classes), FP is the number of classifying negative samples as positive samples, TN is the number of classifying negative samples as negative samples.

The three indexes are insensitive to data distribution, and are beneficial to comparison among different software defect prediction model algorithms.

The distribution of the classification effect obtained in the experiment is shown in the following tables:

TABLE 3 Balance

Dataset	CWCAR	CBA	PART	DT	RF	NB
							ant-1.3	0.793	0.648	0.439	0.345	0.546	0.772
ant-1.4	0.615	0.528	0.409	0.331	0.474	0.486
							ant-1.5	0.726	0.523	0.390	0.321	0.494	0.705
ant-1.6	0.774	0.630	0.662	0.611	0.660	0.769
							ant-1.7	0.752	0.580	0.645	0.635	0.639	0.722
camel-1.0	0.586	0.302	0.293	0.293	0.298	0.423
							camel-1.2	0.537	0.431	0.465	0.351	0.511	0.508
camel-1.4	0.649	0.345	0.370	0.380	0.401	0.602
							camel-1.6	0.589	0.341	0.349	0.324	0.391	0.511
Mean value	0.669	0.481	0.447	0.399	0.490	0.611
							Median number	0.649	0.523	0.409	0.345	0.494	0.602
Ordering (Rank)	1.00	4.22	4.39	5.61	3.56	2.22
							p-value	-	0.008	0.008	0.008	0.008	0.008

TABLE 4 MCC

TABLE 5 Gmean

Dataset	CWCAR	CBA	PART	DT	RF	NB
							ant-1.3	0.809	0.691	0.327	0.121	0.539	0.789
ant-1.4	0.620	0.550	0.317	0.133	0.456	0.468
							ant-1.5	0.735	0.556	0.259	0.096	0.508	0.719
ant-1.6	0.779	0.665	0.686	0.637	0.687	0.775
							ant-1.7	0.754	0.618	0.675	0.669	0.668	0.729
camel-1.0	0.598	0.113	0.000	0.000	0.011	0.283
							camel-1.2	0.530	0.424	0.461	0.252	0.518	0.512
camel-1.4	0.652	0.268	0.314	0.317	0.376	0.608
							camel-1.6	0.586	0.259	0.254	0.170	0.355	0.522
Mean value	0.674	0.460	0.366	0.266	0.458	0.601
							Median number	0.652	0.55	0.317	0.17	0.508	0.608
Ordering (Rank)	1.00	4.11	4.61	5.50	3.56	2.22
							p-value	-	0.008	0.008	0.008	0.008	0.008

Where W represents win, T represents level and L represents loss.

From the tables and analysis, the software defect prediction method based on the Classification rule of the weighting correlation of the Pearson correlation provided by the invention has good effects on balance, MCC and Gmean:

(1) compared with the classic association rule algorithm CBA, the balance mean value of the proposed CWCCAR algorithm on 9 data sets is relatively improved by 39.1%, the mean value on the MCC index is relatively improved by 22.7%, and the mean value on the Gmean index is relatively improved by 46.3%; the balance median was relatively increased by 24.1% on the 9 data sets, the median was relatively decreased by 8.6% on the mcc scale, and the median was relatively increased by 18.5% on the Gmean scale. This shows that compared with the classic association rule algorithm CBA, the proposed software defect prediction method CWCAR based on the pearson correlation weighted association classification rule has relatively higher mean and median on three indexes insensitive to class imbalance.

(2) Compared with the traditional rule/tree-based algorithm PART and the decision table DT, the balance mean value of the proposed CWCAR algorithm on 9 data sets is at least relatively improved by 49.7%, the mean value on the MCC index is at least relatively improved by 64.2%, and the mean value on the Gmean index is at least relatively improved by 84.1%; the balance median was at least a relative increase of 58.7% over the 9 data sets, the median was at least a relative increase of 97.5% on the mcc scale, and the median was at least a relative increase of 105.7% on the Gmean scale. This shows that compared with the conventional rule/tree-based algorithm PART and decision table DT, the proposed software defect prediction method CWCAR based on the pearson correlation weighted association classification rule has relatively higher mean and median on three indexes insensitive to class imbalance.

(3) Compared with an irregular algorithm naive Bayes NB and a random forest RF, the proposed CWCCAR algorithm has the advantages that the balance mean value of 9 data sets is at least relatively improved by 9.5%, the mean value of an MCC index is at least relatively improved by 5.7%, and the mean value of a Gmean index is at least relatively improved by 12.2%; the balance median was at least a relative 7.8% improvement over the 9 data sets, the median was at least a relative 6.8% improvement over the mcc index, and the median was at least a relative 7.2% improvement over the Gmean index. This shows that the proposed software defect prediction method CWCAR based on the pearson correlation weighted association classification rule has relatively higher mean and median on three indexes insensitive to class imbalance compared with the irregular algorithm naive bayes NB and random forest RF.

(4) The three index ranks rank are respectively compared, the smaller the rank is, the better the prediction performance is, on the balance index, the rank of the CWCCAR is 1, the rank of the naive Bayes NB is only 2.22 after the CWCCAR, and the rank of the decision table DT is 5.61, which is the most back ranked one; on the MCC index, CWCAR is ranked 1.89, na iotave bayes NB is ranked 2.44 next to CWCAR, decision table DT is ranked 4.89, the last ranked one; on the Gmean index, CWCAR is ranked 1, na iotave bayes NB is ranked 2.22 next to CWCAR, and decision table DT is ranked 5.50, the last ranked one. This shows that compared with five types of standard classifiers, the proposed software defect prediction method CWCCAR based on the Classon correlation weighting classification rule has the smallest ranking rank on three indexes insensitive to class imbalance and the best prediction performance.

(5) In order to better verify the significance difference of the algorithm proposed by the invention from other algorithms, the invention adopts nonparametric Wilcoxon Signed-Rank sum test, the significance level is 0.05, and the samples do not need to follow normal distribution. According to the comparison results of the three indexes, the p-value values of the balance index and the Gmean index are both less than 0.05, and the provided algorithm CWCCAR is completely superior to other baseline classifiers; on the MCC index, the proposed algorithm CWCAR is completely more significant than other classifiers (p-value is less than 0.05) except that the algorithm is not significantly different from naive bayes (p-value is 0.086> 0.05). This shows that compared with five kinds of reference classifiers, the proposed software defect prediction method CWCCAR based on the Classon correlation weighting and association classification rule is completely superior to other four reference classifiers except naive Bayes in three indexes insensitive to class imbalance.

In conclusion, the software defect prediction method based on the Classification rule of the weighted correlation of the Pearson correlation has better prediction performance.

The invention provides a software defect prediction method based on a Pearson correlation weighting correlation classification rule, which provides an automatic feature weighting method insensitive to unbalanced data by constructing a software defect prediction framework oriented to the unbalanced data and based on the correlation classification rule, and combines the automatic feature weighting method with the generation, sequencing, pruning and prediction processes of the correlation classification rule to form a valuable, high-performance and understandable rule model, thereby revealing the correlation between defect tendency and features, improving the high performance and understandability of the software defect prediction model and improving the accuracy of a prediction result.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The software defect prediction method based on the Classification rule of the weighted correlation of the Pearson correlation is characterized by comprising the following steps:

2. The prediction method of claim 1, wherein the step of constructing the software defect prediction model based on the pearson correlation weighted relevance classification rule comprises:

3. The prediction method of claim 2, wherein the constructing step further comprises:

4. The prediction method according to claim 2, wherein the step S31 includes:

5. The prediction method according to claim 4, wherein the step S312 comprises:

6. The prediction method according to claim 4, wherein the step S32 includes:

weight(itemset)＝weight(X₁)*weight(X₂)…weight(X_k)

＝weight(item₁)*weight(item₂)…weight(item_k) (3)；

wsupp(r)＝weight(itemse=t)*support(itemset) (4)；

and S325, respectively mining the weighted candidate item meeting the minimum support degree of the defect tendency class and the weighted candidate item meeting the minimum support degree of the defect-free tendency class.

7. The prediction method according to claim 2, wherein the step S33 includes:

8. The prediction method of claim 2, wherein the conflict rule pruning in step S34 is: when the front pieces of the two rules have the same characteristics and the back pieces belong to different categories, both the two rules are removed;

9. The prediction method according to claim 2, wherein the step S35 includes:

10. The prediction method according to claim 3, wherein the step S36 includes:

the evaluation indexes of G-mean, Mcc and Balance are defined as follows: