CN112527670A - Method for predicting software aging defects in project based on Active Learning - Google Patents

Method for predicting software aging defects in project based on Active Learning Download PDF

Info

Publication number
CN112527670A
CN112527670A CN202011511241.4A CN202011511241A CN112527670A CN 112527670 A CN112527670 A CN 112527670A CN 202011511241 A CN202011511241 A CN 202011511241A CN 112527670 A CN112527670 A CN 112527670A
Authority
CN
China
Prior art keywords
samples
software
active learning
project
aging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011511241.4A
Other languages
Chinese (zh)
Other versions
CN112527670B (en
Inventor
向剑文
梁梦婷
李滴萌
赵冬冬
胡文华
李琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202011511241.4A priority Critical patent/CN112527670B/en
Publication of CN112527670A publication Critical patent/CN112527670A/en
Application granted granted Critical
Publication of CN112527670B publication Critical patent/CN112527670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an Active Learning-based in-project software aging prediction method, which comprises the steps of collecting code static measurement in software, using Active Learning to select a sample for tagging as a training set, and predicting the remaining sample without class marks. And selecting a sample and manually labeling by adopting Active Learning to form a training set. And (3) relieving the class imbalance problem by adopting an oversampling and undersampling combined method, and predicting by using a machine learning classifier. The invention considers that the software aging defect data set has few samples and is time-consuming and labor-consuming to collect, and the method of combining under-sampling and over-sampling is adopted to relieve the polar imbalance problem, thereby being beneficial to the developer to find and remove the software aging related defects in the development and test stage and avoiding the loss caused by the software aging problem. The method has verified the feasibility of the real software, and can be popularized to other software to predict the software aging related defects.

Description

Method for predicting software aging defects in project based on Active Learning
Technical Field
The invention belongs to the technical field of software aging prediction, and particularly relates to an Active Learning-based method for predicting software aging defects in projects.
Background
In a long-running operating system, software aging is a major cause of system performance degradation or software crash. It is caused by software Aging-Related defects (ARBs), such as memory leaks, unreleased file locks, storage problems, etc. And it has been found to exist in a variety of systems, such as Android, Linux, Windows, etc. The complexity and time characteristics of software aging make its detection difficult. Therefore, predicting and removing the defects related to software aging in the development testing stage (code level) is one of the important ways to reduce the loss caused by software aging.
In recent years, aging defect prediction has been receiving attention from researchers in the field of reliability. Some scholars train the model to predict the aging defects in the project by using static characteristics of the code (such as the number of lines of the code, the number of annotations and the like) and using methods such as machine learning and the like, however, since the aging defects are less, such as the aging defects in the Linux aging defect data set are only 0.59%, it is very difficult for us to collect enough training data in the project to model.
Aiming at the problem that software aging training data is insufficient, a learner provides cross-project software aging defect prediction, and the main method is to reduce distribution differences through transfer learning and further process the class imbalance problem to perform cross-project aging defect prediction. Although the data amount is sufficient, the difference between different projects is relatively serious, so that the prediction performance of the cross project and the prediction performance of the inner project are different. Moreover, in the previous research, when extremely serious class imbalance is processed, a single over-sampling or under-sampling mode is used, so that over-fitting is easily caused, and different machine learning classifiers are not robust enough, namely, the prediction effects are greatly different.
Disclosure of Invention
In order to overcome the defects of the background art, the invention provides an Active Learning-based intra-project software aging defect prediction method.
In order to solve the technical problems, the invention adopts the technical scheme that:
an Active Learning-based intra-project software aging defect prediction method comprises the following steps:
step 1, selecting a first type sample which is representative and rich in information content from a sample without a class mark in a project by using Active Learning;
step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;
step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
and 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.
Preferably, in the step of selecting Representative and information-rich first samples by Active Learning, a method of Active Learning by query and regenerative samples (QUIRE) provided in the Active Learning field is used, samples with large information content are selected according to the uncertainty of the classifier trained by the samples with the targets to be selected, and Representative samples are selected according to the uncertainty of the classifier trained by the unlabeled samples to be selected.
Preferably, in the step 2, the selected samples and the existing labeled samples are used to form a training set, and the training set includes the initial labeled samples of the whole project and representative samples selected by Active Learning.
Preferably, a sampling mode of SMOTE + ENN is adopted, the minority class is oversampled, and the majority class is undersampled to form a final training set.
Preferably, in the step of training the prediction model by adopting the machine learning method, the cross-project prediction task is executed by adopting the machine learning method; the method adopts six classifiers of decision tree, K neighbor, support vector machine, logistic regression and random forest and naive Bayes as a machine learning method; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.
The invention has the beneficial effects that: aiming at the problem of small aging defect data quantity, a new method for predicting the aging defect of software in the project based on Active Learning is provided. The method solves the problems of insufficient training data samples and serious class imbalance in the field of software aging defect prediction in software aging defect prediction, has stronger robustness, relieves the condition that the collection of aging data sets is time-consuming and labor-consuming, can also obtain better effect, and can avoid the loss caused by software aging.
Drawings
Fig. 1 is a schematic flow chart of a method for predicting an aging defect of software in a project based on Active Learning according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides an Active Learning-based in-project software aging defect prediction method, and a flow diagram of the in-project aging defect prediction process of the embodiment of the invention is shown in fig. 1. And then, according to the characteristics of the aging data set, a mode of jointly using an oversampling SMOTE method and an undersampling ENN method is adopted to solve the serious class imbalance problem. And finally, classifying the target item by using a machine learning classifier, and outputting a prediction result.
The method comprises the following steps:
step 1, for a sample without class marks in a project, selecting a representative first class sample with rich information content by using Active Learning.
The method comprises the steps of selecting samples with large information quantity according to the uncertainty of a classifier trained by samples with class labels on the samples to be selected, and selecting Representative samples according to the uncertainty of the classifier trained by unlabeled samples on the samples to be selected by using an Active Learning by query information and regenerative samples (QUIRE) method proposed in the field of Active Learning.
First f*Representing a classification model trained from labeled samples:
Figure BDA0002846467880000031
where H is a regenerating nuclear Hilbert space with nuclear functions and l (x) is a loss function. The boundary-based Active Learning method selects the unmarked sample closest to the decision boundary, namely:
Figure BDA0002846467880000032
connecting edge-based query selection with proactively learned min-max formulas
Figure BDA0002846467880000041
Wherein
Figure BDA0002846467880000042
In the min-max view of active learning, it guarantees the selected instance xsWill result in a small value of the objective function regardless of its class label ys. To select information rich and representative queries, the evaluation function L (D) is extendedl,xs) Including all data not marked. Suppose, D is knownuClass assignment of unselected unlabeled instances in yuThe evaluation function can be modified
Figure BDA0002846467880000043
X satisfying the above formula is obtained belowS
For simplicity of calculation
Figure BDA0002846467880000044
The above formula becomes:
Figure BDA0002846467880000045
wherein L ═ (K + Lambda I)-1K is the kernel matrix, thus
Figure BDA0002846467880000046
It can be simplified to:
Figure BDA0002846467880000047
to efficiently calculate the above number for each unlabeled instance, for ease of reference, the rows/columns in matrix M are denoted by subscript u for DuThe row/column in M of the marked instance is denoted by the subscript l, and the row/column in M of the selected instance is denoted by the subscript s. All unlabeled instances in M (i.e., D) are also referred to by subscript au∪{xSRow/column of). With these conventions, the objective function is rewritten
Figure BDA0002846467880000048
Order to
Figure BDA0002846467880000051
To obtain
Figure BDA0002846467880000052
The last step follows the following conditions:
Figure BDA0002846467880000053
step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;
and in the step of forming a training set by adopting the selected samples and the existing samples with the similar targets, the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.
Datatrain=Datalabled∪Dataselected
Wherein DatatrainRepresenting a training set, DataselectedData, representing representative samples selected by Active LearninglabledRepresenting an initial marked sample of the item.
Step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
the class imbalance problem faced by software aging defect prediction is very serious, for example, the class imbalance problem is commonly used in Linux data sets for aging defect prediction, and the aging defect accounts for only 0.59%. The imbalance-like problem needs to be addressed. In this step, SMOTE is used to oversample the minority class first, and then ENN is used to undersample the majority class.
The basic idea of the SMOTE algorithm is to analyze a few classes of samples and artificially synthesize new samples from the few classes of samples to add to the dataset, and the algorithm flow is as follows.
(1) For each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.
(2) And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x, wherein the selected neighbors are assumed to be xn.
(3) For each randomly selected neighbor xnNew samples are constructed according to the following formulas, respectively, with the original samples.
Figure BDA0002846467880000061
The main idea of the ENN algorithm is as follows: finding out three nearest neighbor samples of each sample in the training sample set, and deleting the samples if the samples are majority samples and more than two of the three nearest neighbors are minority samples; otherwise, when the sample is a minority class and more than two of the three nearest neighbors are majority class samples, the majority class samples in the neighbors are removed.
The class imbalance problem is handled in combination with the above.
And 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.
In this step, a machine learning algorithm is used to predict the target item, such as Naive Bayes (NB), Logistic Regression (LR), K-nearest neighbors (KNN), Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), etc. Classifier parameters were determined by ten-fold cross-validation. Six different machine learning classifiers all achieve the best results, with NB and KNN being used as the classifiers.
According to the invention, by collecting code static measurement in software, Active Learning is used for selecting samples to be labeled as a training set, and the remaining samples without class labels are predicted. And selecting samples by Active Learning, selecting representative samples with rich information content according to a certain strategy, and manually labeling the samples, wherein the samples form a training set of the text. And then, relieving the serious class imbalance problem in software aging by adopting an oversampling and undersampling combined method, and finally predicting by using a machine learning classifier (naive Bayes, logistic regression and the like). The invention considers that the software aging defect data set has fewer samples, is time-consuming and labor-consuming to collect, and further adopts a method of combining undersampling and oversampling to relieve the extremely serious class imbalance problem. The method solves the problems of too small software aging data amount, difficult collection and prediction accuracy of software aging defects in projects, is beneficial to developers to find and remove software aging related defects in the development and test stage, and avoids loss caused by the software aging problem. The method has verified the feasibility of the real software, and can be popularized to other software to predict the software aging related defects.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (5)

1. An intra-project software aging defect prediction method based on Active Learning is characterized by comprising the following steps:
step 1, selecting a first type sample which is representative and rich in information content from a sample without a class mark in a project by using Active Learning;
step 2, aiming at the selected first classification sample, adding the existing sample with the classification mark to form a training set;
step 3, aiming at the training set, carrying out class unbalance problem processing by adopting an oversampling method SMOTE and an undersampling method ENN in a combined mode, and learning classification characteristics;
and 4, aiming at the data processed in the step 3, training a prediction model by adopting a machine learning method and predicting the aging defects on the test set.
2. The method for predicting the aging defect of the software in the project based on the Active Learning as claimed in claim 1, wherein in the step of selecting the Representative and abundant information samples by the Active Learning, the Active Learning by query information and regenerative samples (QUIRE) method provided by the Active Learning field is used, the samples with large information are selected according to the uncertainty of the classifier trained by the samples with the targets to be selected, and the Representative samples are selected according to the uncertainty of the classifier trained by the unlabeled samples to the samples to be selected.
3. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step 2, the selected samples and the existing samples with the similar targets are adopted to form a training set, and the training set comprises the initial marked samples of the whole project and representative samples selected by Active Learning.
4. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: and (3) adopting a sampling mode of SMOTE + ENN, oversampling a few classes, and undersampling a plurality of classes to form a final training set.
5. The method of claim 1, wherein the method for predicting aging defects of software in project based on Active Learning comprises: in the step of training the prediction model by adopting a machine learning method, a cross-project prediction task is executed by adopting the machine learning method; the method adopts six classifiers of decision tree, K neighbor, support vector machine, logistic regression and random forest and naive Bayes as a machine learning method; the selection of the optimal parameters of the classifier is determined by adopting a ten-fold cross-validation method.
CN202011511241.4A 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning Active CN112527670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011511241.4A CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011511241.4A CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Publications (2)

Publication Number Publication Date
CN112527670A true CN112527670A (en) 2021-03-19
CN112527670B CN112527670B (en) 2022-06-03

Family

ID=75001726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011511241.4A Active CN112527670B (en) 2020-12-18 2020-12-18 Method for predicting software aging defects in project based on Active Learning

Country Status (1)

Country Link
CN (1) CN112527670B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113702728A (en) * 2021-07-12 2021-11-26 广东工业大学 Transformer fault diagnosis method and system based on combined sampling and LightGBM

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589806A (en) * 2015-12-17 2016-05-18 北京航空航天大学 SMOTE+Boosting algorithm based software defect tendency prediction method
CN108304316A (en) * 2017-12-25 2018-07-20 浙江工业大学 A kind of Software Defects Predict Methods based on collaboration migration
CN109933539A (en) * 2019-04-15 2019-06-25 燕山大学 A kind of Software Defects Predict Methods based on principal component analysis and combination sampling
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance
CN110751186A (en) * 2019-09-26 2020-02-04 北京航空航天大学 Cross-project software defect prediction method based on supervised expression learning
US20200097388A1 (en) * 2018-09-26 2020-03-26 Accenture Global Solutions Limited Learning based metrics prediction for software development
CN111368924A (en) * 2020-03-05 2020-07-03 南京理工大学 Unbalanced data classification method based on active learning
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN111881048A (en) * 2020-07-31 2020-11-03 武汉理工大学 Cross-project software aging defect prediction method
CN111881023A (en) * 2020-07-10 2020-11-03 武汉理工大学 Software aging prediction method and device based on multi-model comparison

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589806A (en) * 2015-12-17 2016-05-18 北京航空航天大学 SMOTE+Boosting algorithm based software defect tendency prediction method
CN108304316A (en) * 2017-12-25 2018-07-20 浙江工业大学 A kind of Software Defects Predict Methods based on collaboration migration
US20200097388A1 (en) * 2018-09-26 2020-03-26 Accenture Global Solutions Limited Learning based metrics prediction for software development
CN109933539A (en) * 2019-04-15 2019-06-25 燕山大学 A kind of Software Defects Predict Methods based on principal component analysis and combination sampling
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance
CN110751186A (en) * 2019-09-26 2020-02-04 北京航空航天大学 Cross-project software defect prediction method based on supervised expression learning
CN111368924A (en) * 2020-03-05 2020-07-03 南京理工大学 Unbalanced data classification method based on active learning
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN111881023A (en) * 2020-07-10 2020-11-03 武汉理工大学 Software aging prediction method and device based on multi-model comparison
CN111881048A (en) * 2020-07-31 2020-11-03 武汉理工大学 Cross-project software aging defect prediction method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
H TU,Z YU,T MENZIES: "Better Data Labelling with EMBLEM", 《IEEE》 *
KWABENA BENNIN: "MAHAKIL:Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction", 《IEEE》 *
蔡亮: "即时软件缺陷预测研究进展", 《软件学报》 *
陆鹏程等: "面向软件缺陷预测的聚类欠采样集成方法", 《计算机工程与设计》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113702728A (en) * 2021-07-12 2021-11-26 广东工业大学 Transformer fault diagnosis method and system based on combined sampling and LightGBM

Also Published As

Publication number Publication date
CN112527670B (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN103294716A (en) On-line semi-supervised learning method and device for classifier, and processing equipment
US20180082215A1 (en) Information processing apparatus and information processing method
CN111311401A (en) Financial default probability prediction model based on LightGBM
CN108805157B (en) Remote sensing image classification method based on partial random supervision discrete hash
US20220067531A1 (en) Efficient identification of critical faults in neuromorphic hardware of a neural network
CN112699953B (en) Feature pyramid neural network architecture searching method based on multi-information path aggregation
CN111695602B (en) Multi-dimensional task face beauty prediction method, system and storage medium
CN111325264A (en) Multi-label data classification method based on entropy
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN112527670B (en) Method for predicting software aging defects in project based on Active Learning
CN111833310A (en) Surface defect classification method based on neural network architecture search
CN114609994A (en) Fault diagnosis method and device based on multi-granularity regularization rebalance incremental learning
CN116051479A (en) Textile defect identification method integrating cross-domain migration and anomaly detection
CN115861738A (en) Category semantic information guided remote sensing target detection active sampling method
CN112836735A (en) Optimized random forest processing unbalanced data set method
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
CN111767216A (en) Cross-version depth defect prediction method capable of relieving class overlap problem
CN117236278B (en) Chip production simulation method and system based on digital twin technology
CN111881048B (en) Cross-project software aging defect prediction method
CN113822336A (en) Cloud hard disk fault prediction method, device and system and readable storage medium
JP2019160256A (en) Learning discrimination device and method for learning discrimination
CN110705631B (en) SVM-based bulk cargo ship equipment state detection method
CN117173697A (en) Cell mass classification and identification method, device, electronic equipment and storage medium
CN113408546B (en) Single-sample target detection method based on mutual global context attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant