CN112527670B - Method for predicting software aging defects in project based on Active Learning - Google Patents

Method for predicting software aging defects in project based on Active Learning Download PDF

Info

Publication number
CN112527670B
CN112527670B CN202011511241.4A CN202011511241A CN112527670B CN 112527670 B CN112527670 B CN 112527670B CN 202011511241 A CN202011511241 A CN 202011511241A CN 112527670 B CN112527670 B CN 112527670B
Authority
CN
China
Prior art keywords
samples
active learning
sample
software
project
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011511241.4A
Other languages
Chinese (zh)
Other versions
CN112527670A (en
Inventor
向剑文
梁梦婷
李滴萌
赵冬冬
胡文华
李琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202011511241.4A priority Critical patent/CN112527670B/en
Publication of CN112527670A publication Critical patent/CN112527670A/en
Application granted granted Critical
Publication of CN112527670B publication Critical patent/CN112527670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an Active Learning-based in-project software aging prediction method, which comprises the steps of collecting code static measurement in software, using Active Learning to select a sample for tagging as a training set, and predicting the remaining sample without class marks. And selecting a sample and manually labeling by adopting Active Learning to form a training set. And (3) relieving the class imbalance problem by adopting an oversampling and undersampling combined method, and predicting by using a machine learning classifier. The invention considers that the software aging defect data set has few samples and is time-consuming and labor-consuming to collect, and the method of combining under-sampling and over-sampling is adopted to relieve the polar imbalance problem, thereby being beneficial to the developer to find and remove the software aging related defects in the development and test stage and avoiding the loss caused by the software aging problem. The method has verified the feasibility on the real software, and can be popularized to other software to predict the defects related to software aging.

Description

Method for predicting software aging defects in project based on Active Learning
Technical Field
The invention belongs to the technical field of software aging prediction, and particularly relates to an Active Learning-based method for predicting software aging defects in projects.
Background
In a long-running operating system, software aging is a major cause of system performance degradation or software crash. It is caused by software Aging-Related defects (ARBs), such as memory leaks, unreleased file locks, storage problems, etc. And it has been found to exist in a variety of systems, such as Android, Linux, Windows, etc. The complexity and time characteristics of software aging make its detection difficult. Therefore, predicting and removing the defects related to software aging in the development testing stage (code level) is one of the important ways to reduce the loss caused by software aging.
In recent years, aging defect prediction has been receiving attention from researchers in the field of reliability. Some scholars train the model by using static characteristics (such as code line number, comment number and the like) of the code and using a machine learning method and the like to predict the aging defects in the project, however, since the aging defects are less, such as the aging defects in the Linux aging defect data set are only 0.59%, it is very difficult for us to collect enough training data in the project to model.
Aiming at the problem that software aging training data is insufficient, a learner provides cross-project software aging defect prediction, and the main method is to reduce distribution differences through transfer learning and further process the class imbalance problem to perform cross-project aging defect prediction. Although the data amount is sufficient, the difference between different projects is relatively serious, so that the prediction performance of the cross project and the prediction performance of the inner project are different. Moreover, in the previous research, when extremely serious class imbalance is processed, a single over-sampling or under-sampling mode is used, so that over-fitting is easily caused, and different machine learning classifiers are not robust enough, namely, the prediction effects are greatly different.
Disclosure of Invention
In order to overcome the defects of the background art, the invention provides an Active Learning-based intra-project software aging defect prediction method.
In order to solve the technical problems, the invention adopts the technical scheme that:
an Active Learning-based intra-project software aging defect prediction method comprises the following steps:
step 1, selecting a first type sample which is representative and rich in information content from a sample without a class mark in a project by using Active Learning;
step 2, aiming at the selected first classification sample, adding the existing sample with the class mark to form a training set;
step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
and 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.
Preferably, in the step of selecting Representative and information-rich first samples by Active Learning, a method of Active Learning by query and regenerative samples (QUIRE) provided in the Active Learning field is used, samples with large information content are selected according to the uncertainty of the classifier trained by the samples with the targets to be selected, and Representative samples are selected according to the uncertainty of the classifier trained by the unlabeled samples to be selected.
Preferably, in the step 2, the selected samples and the existing labeled samples are used to form a training set, and the training set includes the initial labeled samples of the whole project and representative samples selected by Active Learning.
Preferably, a sampling mode of SMOTE + ENN is adopted, and a final training set is formed by over-sampling a few classes and under-sampling a majority class.
Preferably, in the step of training the prediction model by adopting the machine learning method, the cross-project prediction task is executed by adopting the machine learning method; the method of machine learning is adopted as six classifiers including decision tree, K nearest neighbor, support vector machine, logistic regression and random forest and naive Bayes; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.
The invention has the beneficial effects that: aiming at the problem of small aging defect data quantity, a new method for predicting the aging defect of software in the project based on Active Learning is provided. The method solves the problems of insufficient training data samples and serious class imbalance in the field of software aging defect prediction in software aging defect prediction, has stronger robustness, relieves the condition that the collection of aging data sets is time-consuming and labor-consuming, can also obtain better effect, and can avoid the loss caused by software aging.
Drawings
Fig. 1 is a schematic flow chart of a method for predicting an aging defect of software in a project based on Active Learning according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides an Active Learning-based in-project software aging defect prediction method, and a flow diagram of the in-project aging defect prediction process of the embodiment of the invention is shown in fig. 1. And then, according to the characteristics of the aging data set, a mode of jointly using an oversampling SMOTE method and an undersampling ENN method is adopted to solve the serious class imbalance problem. And finally, classifying the target item by using a machine learning classifier, and outputting a prediction result.
The method comprises the following steps:
step 1, for a sample without class marks in a project, selecting a representative first class sample with rich information content by using Active Learning.
The method comprises the steps of selecting samples with large information quantity according to the uncertainty of a classifier trained by samples with class labels on the samples to be selected, and selecting Representative samples according to the uncertainty of the classifier trained by unlabeled samples on the samples to be selected by using an Active Learning by query information and regenerative samples (QUIRE) method proposed in the field of Active Learning.
First f*Representing a classification model trained from labeled samples:
Figure GDA0003559717790000031
where H is a regenerating nuclear Hilbert space with nuclear functions and l (x) is a loss function. The boundary-based Active Learning method selects the unmarked sample closest to the decision boundary, namely:
Figure GDA0003559717790000032
connecting edge-based query selection with proactively learned min-max formulas
Figure GDA0003559717790000041
Wherein
Figure GDA0003559717790000042
In the min-max view of active learning, it guarantees the selected instance xsWill result in a small value of the objective function regardless of its class label ys. To select information rich and representative queries, the evaluation function L (D) is extendedl,xs) Including all data not marked. Suppose, D is knownuClass assignment of unselected unlabeled instances in yuThe evaluation function can be modified
Figure GDA0003559717790000043
X satisfying the above formula is obtained belows
For simplicity of calculation
Figure GDA0003559717790000044
The above formula becomes:
Figure GDA0003559717790000045
wherein L ═ (K + Lambda I)-1K is the kernel matrix, thus
Figure GDA0003559717790000046
It can be simplified into:
Figure GDA0003559717790000047
to efficiently calculate the above number for each unlabeled instance, for ease of reference, the rows/columns in matrix M are denoted by subscript u for DuThe row/column in M of the marked instance is denoted by the subscript l, and the row/column in M of the selected instance is denoted by the subscript s. All unlabeled instances in M (i.e., D) are also referred to by subscript au∪{xsRow/column of). With these conventions, the objective function is rewritten
Figure GDA0003559717790000048
Order to
Figure GDA0003559717790000051
To obtain
Figure GDA0003559717790000052
The last step follows the following conditions:
Figure GDA0003559717790000053
step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;
and in the step of forming a training set by adopting the selected samples and the existing samples with the similar targets, the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.
Datatrain=Datalabled∪Dataselected
Wherein DatatrainRepresenting a training set, DataselectedExpress ActiveThe samples selected by Learning are representative samples, DatalabledRepresenting an initial marked sample of the item.
Step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
the class imbalance problem faced by software aging defect prediction is very serious, for example, the class imbalance problem is commonly used in Linux data sets for aging defect prediction, and the aging defect accounts for only 0.59%. The imbalance-like problem needs to be addressed. In this step, SMOTE is used to oversample the minority class first, and then ENN is used to undersample the majority class.
The basic idea of the SMOTE algorithm is to analyze a few classes of samples and artificially synthesize new samples from the few classes of samples to add to the dataset, and the algorithm flow is as follows.
(1) For each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.
(2) And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x, wherein the selected neighbors are assumed to be xn.
(3) For each randomly selected neighbor xnNew samples are constructed according to the following formulas, respectively, with the original samples.
Figure GDA0003559717790000061
The main idea of the ENN algorithm is as follows: finding out three nearest neighbor samples of each sample in the training sample set, and deleting the samples if the samples are majority samples and more than two of the three nearest neighbors are minority samples; otherwise, when the sample is a minority class and more than two of the three nearest neighbors are majority class samples, the majority class samples in the neighbors are removed.
The class imbalance problem is handled in combination with the above.
And 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.
In this step, a machine learning algorithm is used to predict the target item, such as Naive Bayes (NB), Logistic Regression (LR), K-nearest neighbors (KNN), Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), etc. Classifier parameters were determined by ten-fold cross validation. Six different machine learning classifiers all achieve the best results, with NB and KNN being used as the classifiers.
According to the invention, by collecting code static measurement in software, Active Learning is used for selecting samples to be labeled as a training set, and the remaining samples without class labels are predicted. And selecting samples by Active Learning, selecting representative samples with rich information content according to a certain strategy, and manually labeling the samples, wherein the samples form a training set of the text. And then, relieving the serious class imbalance problem in software aging by adopting an oversampling and undersampling combined method, and finally predicting by using a machine learning classifier (naive Bayes, logistic regression and the like). The invention considers that the software aging defect data set has fewer samples, is time-consuming and labor-consuming to collect, and further adopts a method of combining undersampling and oversampling to relieve the extremely serious class imbalance problem. The method solves the problems that the software aging data volume is too small and is difficult to collect and the prediction precision of the software aging defects in the project is high, and is beneficial to a developer to find and remove the software aging related defects in the development and test stage, so that the loss caused by the software aging problem is avoided. The method has verified the feasibility of the real software, and can be popularized to other software to predict the software aging related defects.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (5)

1. An intra-project software aging defect prediction method based on Active Learning is characterized by comprising the following steps:
step 1, selecting a first type sample which is representative and rich in information content from a sample without a class mark in a project by using Active Learning;
step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;
step 3, aiming at the training set, carrying out class imbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
step 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on a test set;
in the step 1, using an Active Learning by query information and regenerative samples method proposed in the Active Learning field, a sample with a large amount of information is selected according to the uncertainty of a classifier trained from a sample with a label to the sample to be selected, and a Representative sample is selected according to the uncertainty of the classifier trained from an unlabeled sample to the sample to be selected:
first f*Representing a classification model trained from labeled samples:
Figure FDA0003569456600000011
where H is a regenerating kernel Hilbert space with kernel function, l (x) is a loss function, and the boundary-based Active Learning method selects the unmarked sample closest to the decision boundary, i.e.:
Figure FDA0003569456600000012
connecting edge-based query selection with proactively learned min-max formulas
Figure FDA0003569456600000013
Wherein
Figure FDA0003569456600000021
In the min-max view of active learning, it guarantees the selected instance xsWill result in a small value of the objective function regardless of its class label ys(ii) a To select information rich and representative queries, the evaluation function L (D) is extendedl,xs) Including all unmarked data; suppose, D is knownuClass assignment of unselected unlabeled instances in yuThen the evaluation function is modified to:
Figure FDA0003569456600000022
x satisfying the above formula is obtained belowS
For simplicity of calculation
Figure FDA0003569456600000023
The above formula becomes:
Figure FDA0003569456600000024
wherein L ═ (K + Lambda I)-1K is the kernel matrix, thus
Figure FDA0003569456600000025
It can be simplified to:
Figure FDA0003569456600000026
to efficiently count the number of each unlabeled instance, for ease of reference, the rows/columns in matrix M are denoted by subscript u for DuThe row/column in M of the marked instance is denoted by subscript l, and the row/column in M of the selected instance is denoted by subscript s; all unlabeled instances in M, i.e. D, are also indicated by the subscript au∪{xSRow/column of }; with these conventions, the objective function is rewritten:
Figure FDA0003569456600000027
order to
Figure FDA0003569456600000028
To obtain
Figure FDA0003569456600000031
The last step follows the following conditions:
Figure FDA0003569456600000032
2. the method for predicting the aging defect of the software in the project based on the Active Learning as claimed in claim 1, wherein in the step of selecting the first type of samples with representativeness and rich information content by using the Active Learning, the Active Learning by query information and regenerative samples method provided by the Active Learning field is used, the samples with large information content are selected according to the uncertainty of the classifier trained by the samples with the targets to the samples to be selected, and the samples with representativeness are selected according to the uncertainty of the classifier trained by the unlabeled samples to the samples to be selected.
3. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step 2, the selected samples and the existing samples with the similar targets are adopted to form a training set, and the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.
4. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: and (3) adopting a sampling mode of SMOTE + ENN, oversampling a few classes, and undersampling a plurality of classes to form a final training set.
5. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step of training the prediction model by adopting a machine learning method, a cross-project prediction task is executed by adopting the machine learning method; the method adopts six classifiers of decision tree, K neighbor, support vector machine, logistic regression and random forest and naive Bayes as a machine learning method; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.
CN202011511241.4A 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning Active CN112527670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011511241.4A CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011511241.4A CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Publications (2)

Publication Number Publication Date
CN112527670A CN112527670A (en) 2021-03-19
CN112527670B true CN112527670B (en) 2022-06-03

Family

ID=75001726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011511241.4A Active CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Country Status (1)

Country Link
CN (1) CN112527670B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113702728A (en) * 2021-07-12 2021-11-26 广东工业大学 Transformer fault diagnosis method and system based on combined sampling and LightGBM

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589806B (en) * 2015-12-17 2018-05-18 北京航空航天大学 A kind of software defect tendency Forecasting Methodology based on SMOTE+Boosting algorithms
CN108304316B (en) * 2017-12-25 2021-04-06 浙江工业大学 Software defect prediction method based on collaborative migration
US10929268B2 (en) * 2018-09-26 2021-02-23 Accenture Global Solutions Limited Learning based metrics prediction for software development
CN109933539A (en) * 2019-04-15 2019-06-25 燕山大学 A kind of Software Defects Predict Methods based on principal component analysis and combination sampling
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance
CN110751186B (en) * 2019-09-26 2022-04-08 北京航空航天大学 Cross-project software defect prediction method based on supervised expression learning
CN111368924A (en) * 2020-03-05 2020-07-03 南京理工大学 Unbalanced data classification method based on active learning
CN111881023B (en) * 2020-07-10 2022-05-06 武汉理工大学 Software aging prediction method and device based on multi-model comparison
CN111881048B (en) * 2020-07-31 2022-06-03 武汉理工大学 Cross-project software aging defect prediction method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set

Also Published As

Publication number Publication date
CN112527670A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN103294716A (en) On-line semi-supervised learning method and device for classifier, and processing equipment
CN111311401A (en) Financial default probability prediction model based on LightGBM
CN112699953B (en) Feature pyramid neural network architecture searching method based on multi-information path aggregation
US20220067531A1 (en) Efficient identification of critical faults in neuromorphic hardware of a neural network
CN111325264A (en) Multi-label data classification method based on entropy
CN111695602B (en) Multi-dimensional task face beauty prediction method, system and storage medium
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN111833310B (en) Surface defect classification method based on neural network architecture search
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN117236278B (en) Chip production simulation method and system based on digital twin technology
CN116051479A (en) Textile defect identification method integrating cross-domain migration and anomaly detection
CN112527670B (en) Method for predicting software aging defects in project based on Active Learning
CN115861738A (en) Category semantic information guided remote sensing target detection active sampling method
CN118133144A (en) Vehicle fault diagnosis method, device, equipment and medium based on graph neural network
CN115033591B (en) Intelligent detection method, system, storage medium and computer equipment for electric charge data abnormality
CN111553442B (en) Optimization method and system for classifier chain tag sequence
CN113822336A (en) Cloud hard disk fault prediction method, device and system and readable storage medium
Zhang et al. An effective approach for multi-label classification with missing labels
CN115984559B (en) Intelligent sample selection method and related device
CN117173697A (en) Cell mass classification and identification method, device, electronic equipment and storage medium
CN113408546B (en) Single-sample target detection method based on mutual global context attention mechanism
Tang et al. Semi-supervised Contrastive Memory Network for Industrial Process Working Condition Monitoring
CN115033493A (en) Workload sensing instant software defect prediction method based on linear programming
CN111310810A (en) Image classification method and system based on feature selection of difference learning and particle swarm
CN116881172B (en) Software defect prediction method based on graph convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant