CN112527670A

CN112527670A - Method for predicting software aging defects in project based on Active Learning

Info

Publication number: CN112527670A
Application number: CN202011511241.4A
Authority: CN
Inventors: 向剑文; 梁梦婷; 李滴萌; 赵冬冬; 胡文华; 李琳
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-19
Anticipated expiration: 2040-12-18
Also published as: CN112527670B

Abstract

The invention discloses an Active Learning-based in-project software aging prediction method, which comprises the steps of collecting code static measurement in software, using Active Learning to select a sample for tagging as a training set, and predicting the remaining sample without class marks. And selecting a sample and manually labeling by adopting Active Learning to form a training set. And (3) relieving the class imbalance problem by adopting an oversampling and undersampling combined method, and predicting by using a machine learning classifier. The invention considers that the software aging defect data set has few samples and is time-consuming and labor-consuming to collect, and the method of combining under-sampling and over-sampling is adopted to relieve the polar imbalance problem, thereby being beneficial to the developer to find and remove the software aging related defects in the development and test stage and avoiding the loss caused by the software aging problem. The method has verified the feasibility of the real software, and can be popularized to other software to predict the software aging related defects.

Description

Method for predicting software aging defects in project based on Active Learning

Technical Field

The invention belongs to the technical field of software aging prediction, and particularly relates to an Active Learning-based method for predicting software aging defects in projects.

Background

In a long-running operating system, software aging is a major cause of system performance degradation or software crash. It is caused by software Aging-Related defects (ARBs), such as memory leaks, unreleased file locks, storage problems, etc. And it has been found to exist in a variety of systems, such as Android, Linux, Windows, etc. The complexity and time characteristics of software aging make its detection difficult. Therefore, predicting and removing the defects related to software aging in the development testing stage (code level) is one of the important ways to reduce the loss caused by software aging.

In recent years, aging defect prediction has been receiving attention from researchers in the field of reliability. Some scholars train the model to predict the aging defects in the project by using static characteristics of the code (such as the number of lines of the code, the number of annotations and the like) and using methods such as machine learning and the like, however, since the aging defects are less, such as the aging defects in the Linux aging defect data set are only 0.59%, it is very difficult for us to collect enough training data in the project to model.

Aiming at the problem that software aging training data is insufficient, a learner provides cross-project software aging defect prediction, and the main method is to reduce distribution differences through transfer learning and further process the class imbalance problem to perform cross-project aging defect prediction. Although the data amount is sufficient, the difference between different projects is relatively serious, so that the prediction performance of the cross project and the prediction performance of the inner project are different. Moreover, in the previous research, when extremely serious class imbalance is processed, a single over-sampling or under-sampling mode is used, so that over-fitting is easily caused, and different machine learning classifiers are not robust enough, namely, the prediction effects are greatly different.

Disclosure of Invention

In order to overcome the defects of the background art, the invention provides an Active Learning-based intra-project software aging defect prediction method.

In order to solve the technical problems, the invention adopts the technical scheme that:

an Active Learning-based intra-project software aging defect prediction method comprises the following steps:

step 1, selecting a first type sample which is representative and rich in information content from a sample without a class mark in a project by using Active Learning;

step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;

step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;

and 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.

Preferably, in the step of selecting Representative and information-rich first samples by Active Learning, a method of Active Learning by query and regenerative samples (QUIRE) provided in the Active Learning field is used, samples with large information content are selected according to the uncertainty of the classifier trained by the samples with the targets to be selected, and Representative samples are selected according to the uncertainty of the classifier trained by the unlabeled samples to be selected.

Preferably, in the step 2, the selected samples and the existing labeled samples are used to form a training set, and the training set includes the initial labeled samples of the whole project and representative samples selected by Active Learning.

Preferably, a sampling mode of SMOTE + ENN is adopted, the minority class is oversampled, and the majority class is undersampled to form a final training set.

Preferably, in the step of training the prediction model by adopting the machine learning method, the cross-project prediction task is executed by adopting the machine learning method; the method adopts six classifiers of decision tree, K neighbor, support vector machine, logistic regression and random forest and naive Bayes as a machine learning method; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.

The invention has the beneficial effects that: aiming at the problem of small aging defect data quantity, a new method for predicting the aging defect of software in the project based on Active Learning is provided. The method solves the problems of insufficient training data samples and serious class imbalance in the field of software aging defect prediction in software aging defect prediction, has stronger robustness, relieves the condition that the collection of aging data sets is time-consuming and labor-consuming, can also obtain better effect, and can avoid the loss caused by software aging.

Drawings

Fig. 1 is a schematic flow chart of a method for predicting an aging defect of software in a project based on Active Learning according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides an Active Learning-based in-project software aging defect prediction method, and a flow diagram of the in-project aging defect prediction process of the embodiment of the invention is shown in fig. 1. And then, according to the characteristics of the aging data set, a mode of jointly using an oversampling SMOTE method and an undersampling ENN method is adopted to solve the serious class imbalance problem. And finally, classifying the target item by using a machine learning classifier, and outputting a prediction result.

The method comprises the following steps:

step 1, for a sample without class marks in a project, selecting a representative first class sample with rich information content by using Active Learning.

The method comprises the steps of selecting samples with large information quantity according to the uncertainty of a classifier trained by samples with class labels on the samples to be selected, and selecting Representative samples according to the uncertainty of the classifier trained by unlabeled samples on the samples to be selected by using an Active Learning by query information and regenerative samples (QUIRE) method proposed in the field of Active Learning.

First f^*Representing a classification model trained from labeled samples:

where H is a regenerating nuclear Hilbert space with nuclear functions and l (x) is a loss function. The boundary-based Active Learning method selects the unmarked sample closest to the decision boundary, namely:

connecting edge-based query selection with proactively learned min-max formulas

Wherein

In the min-max view of active learning, it guarantees the selected instance x_sWill result in a small value of the objective function regardless of its class label y_s. To select information rich and representative queries, the evaluation function L (D) is extended_l,x_s) Including all data not marked. Suppose, D is known_uClass assignment of unselected unlabeled instances in y_uThe evaluation function can be modified

X satisfying the above formula is obtained below_S：

For simplicity of calculation

The above formula becomes:

wherein L ═ (K + Lambda I)^-1K is the kernel matrix, thus

It can be simplified to:

to efficiently calculate the above number for each unlabeled instance, for ease of reference, the rows/columns in matrix M are denoted by subscript u for D_uThe row/column in M of the marked instance is denoted by the subscript l, and the row/column in M of the selected instance is denoted by the subscript s. All unlabeled instances in M (i.e., D) are also referred to by subscript a_u∪{x_SRow/column of). With these conventions, the objective function is rewritten

Order to

To obtain

The last step follows the following conditions:

and in the step of forming a training set by adopting the selected samples and the existing samples with the similar targets, the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.

Data_train＝Data_labled∪Data_selected

Wherein Data_trainRepresenting a training set, Data_selectedData, representing representative samples selected by Active Learning_labledRepresenting an initial marked sample of the item.

the class imbalance problem faced by software aging defect prediction is very serious, for example, the class imbalance problem is commonly used in Linux data sets for aging defect prediction, and the aging defect accounts for only 0.59%. The imbalance-like problem needs to be addressed. In this step, SMOTE is used to oversample the minority class first, and then ENN is used to undersample the majority class.

The basic idea of the SMOTE algorithm is to analyze a few classes of samples and artificially synthesize new samples from the few classes of samples to add to the dataset, and the algorithm flow is as follows.

(1) For each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.

(2) And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x, wherein the selected neighbors are assumed to be xn.

(3) For each randomly selected neighbor x_nNew samples are constructed according to the following formulas, respectively, with the original samples.

The main idea of the ENN algorithm is as follows: finding out three nearest neighbor samples of each sample in the training sample set, and deleting the samples if the samples are majority samples and more than two of the three nearest neighbors are minority samples; otherwise, when the sample is a minority class and more than two of the three nearest neighbors are majority class samples, the majority class samples in the neighbors are removed.

The class imbalance problem is handled in combination with the above.

In this step, a machine learning algorithm is used to predict the target item, such as Naive Bayes (NB), Logistic Regression (LR), K-nearest neighbors (KNN), Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), etc. Classifier parameters were determined by ten-fold cross-validation. Six different machine learning classifiers all achieve the best results, with NB and KNN being used as the classifiers.

According to the invention, by collecting code static measurement in software, Active Learning is used for selecting samples to be labeled as a training set, and the remaining samples without class labels are predicted. And selecting samples by Active Learning, selecting representative samples with rich information content according to a certain strategy, and manually labeling the samples, wherein the samples form a training set of the text. And then, relieving the serious class imbalance problem in software aging by adopting an oversampling and undersampling combined method, and finally predicting by using a machine learning classifier (naive Bayes, logistic regression and the like). The invention considers that the software aging defect data set has fewer samples, is time-consuming and labor-consuming to collect, and further adopts a method of combining undersampling and oversampling to relieve the extremely serious class imbalance problem. The method solves the problems of too small software aging data amount, difficult collection and prediction accuracy of software aging defects in projects, is beneficial to developers to find and remove software aging related defects in the development and test stage, and avoids loss caused by the software aging problem. The method has verified the feasibility of the real software, and can be popularized to other software to predict the software aging related defects.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An intra-project software aging defect prediction method based on Active Learning is characterized by comprising the following steps:

2. The method for predicting the aging defect of the software in the project based on the Active Learning as claimed in claim 1, wherein in the step of selecting the Representative and abundant information samples by the Active Learning, the Active Learning by query information and regenerative samples (QUIRE) method provided by the Active Learning field is used, the samples with large information are selected according to the uncertainty of the classifier trained by the samples with the targets to be selected, and the Representative samples are selected according to the uncertainty of the classifier trained by the unlabeled samples to the samples to be selected.

3. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step 2, the selected samples and the existing samples with the similar targets are adopted to form a training set, and the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.

4. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: and (3) adopting a sampling mode of SMOTE + ENN, oversampling a few classes, and undersampling a plurality of classes to form a final training set.

5. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step of training the prediction model by adopting a machine learning method, a cross-project prediction task is executed by adopting the machine learning method; the method adopts six classifiers of decision tree, K neighbor, support vector machine, logistic regression and random forest and naive Bayes as a machine learning method; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.