CN109656808B - Software defect prediction method based on hybrid active learning strategy - Google Patents

Software defect prediction method based on hybrid active learning strategy Download PDF

Info

Publication number
CN109656808B
CN109656808B CN201811319619.3A CN201811319619A CN109656808B CN 109656808 B CN109656808 B CN 109656808B CN 201811319619 A CN201811319619 A CN 201811319619A CN 109656808 B CN109656808 B CN 109656808B
Authority
CN
China
Prior art keywords
sample
data
entropy
data set
information entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811319619.3A
Other languages
Chinese (zh)
Other versions
CN109656808A (en
Inventor
曲豫宾
李芳�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Textile Vocational Technology College
Original Assignee
Nantong Textile Vocational Technology College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Textile Vocational Technology College filed Critical Nantong Textile Vocational Technology College
Priority to CN201811319619.3A priority Critical patent/CN109656808B/en
Publication of CN109656808A publication Critical patent/CN109656808A/en
Application granted granted Critical
Publication of CN109656808B publication Critical patent/CN109656808B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The invention discloses a software defect prediction method based on a hybrid active learning strategy, which adopts a cost-sensitive information entropy and relative entropy collaborative active learning method, uses common information entropy as an evaluation index of a high-quality sample, manually marks the sample with higher information entropy, and further analyzes the sample with low information entropy by using relative entropy, thereby more effectively expanding a marked data set. Experiments show that the method can improve the software defect prediction performance, reduce the manual labeling cost and is more efficient.

Description

Software defect prediction method based on hybrid active learning strategy
Technical Field
The invention relates to the technical field of active learning, in particular to a software defect prediction method based on a hybrid active learning strategy.
Background
The software defect module can cause operation failure in the production process of an enterprise, so that the enterprise generates heavy loss, and the satisfaction degree of a client is reduced. The software defect prediction model is used for discovering the software defect module as early as possible in the software development stage, and common models comprise a supervised model, an unsupervised model and the like.
If the software project has rich historical annotation data, a supervised machine learning model can be established to construct a same-project software defect prediction (with-project defect prediction) model, evaluate the probability of the defects of the software module or calculate the number of the defects of a certain module, and the like. In the actual software development process, if the software project is a brand-new project or the training data of the project is less, an enterprise needs to invest a large amount of time for marking the defect module, meanwhile, the work is relatively strong in speciality, and the marking of the software module needs to be performed by relatively professional personnel, so that the establishment of the software defect prediction model needs to spend a large amount of time and invest much labor, and the cost of software development is increased.
The active learning provides various query strategies for solving the problem of sample labeling, so that enterprises can actively select a certain sample for labeling when facing a mass labeling module, add the sample to be labeled into a labeled sample data set after the manual labeling is completed, and quickly establish a software defect prediction model. The active learning selection strategy is used for selecting high-quality samples from the software defect prediction data set, the training data set is expanded after the samples are marked manually, and meanwhile, other machine learning methods such as dimensionality reduction and feature selection are jointly used for improving the performance of software defect prediction.
The selection strategies used include common strategies such as uncertainty information entropy and the like, however, the research does not pay attention to the samples with low information entropy, namely the samples with high certainty degree, the samples with low information entropy are usually abandoned in the process of actively learning one-time query, and the utilization of the samples with low information entropy is rarely involved.
Patent No. CN201710271035.2 discloses a multi-label active learning method based on condition dependent label set, which screens samples with large information amount as the object of active learning by integrating the sample information entropy and the relative entropy at the same time, and although this method uses the principle that the information entropy and the relative entropy work together, the relative entropy is added at the same time in the information entropy processing stage to calculate, which adversely affects the operation efficiency and effect of the system, and in addition, the low information entropy samples are not well utilized.
Disclosure of Invention
In order to solve the problems of high manual labeling cost and low prediction performance, the invention provides a software defect prediction method based on a hybrid active learning strategy.
A software defect prediction method based on a hybrid active learning strategy is characterized in that a cost-sensitive information entropy and relative entropy collaborative active learning method is adopted in the software defect prediction method based on the hybrid active learning strategy, which is called an UNCERTAINTYKL model for short, the UNCERTAINTYKL model uses information entropy as an evaluation index of high-quality samples, samples with higher information entropy are selected from unlabeled sample data to be labeled manually, and samples with low information entropy are further analyzed by using relative entropy to further expand a labeled data set.
Preferably, the uncertaintikl model comprises the following steps:
step 1: calculating the information entropy of each unmarked sample data through an information entropy calculation formula; step 2: selecting a data sample with the highest information entropy from the unmarked sample data through a calculation formula (1), submitting the data sample to a domain expert for manual marking, and adding a marked data set after marking;
and step 3: screening unlabeled sample data with the lowest residual information entropy in the step 2, and labeling by using a relative entropy calculation mode;
and 4, step 4: presetting a relative entropy threshold, if the relative entropy is lower than the threshold, adding the sample into the marked data set, and simultaneously using the mark of the prediction result as a pseudo mark of the data; if the relative entropy is above the threshold, processing of the sample is abandoned.
Preferably, the calculation method of the data sample with the highest information entropy in step 2 is as follows:
xu,max=argmax(-∑iPθ(yi/x)logPθ(yi/x)) (1)
where i denotes the ith unlabelled example (i ═ 1, 2.. u), yiIndicates the value of the label to which the label to be classified belongs, xu,maxRepresents the data sample with the maximum information entropy in the unmarked data set obtained according to the formula (1), and belongs to the category yiPredicted probability value of pθ(yi/x) represents the case of data distribution based on the marked data set.
Preferably, the relative entropy calculation method in step 3 includes the following formula:
Figure RE-GDA0001983218870000032
Figure RE-GDA0001983218870000033
Figure RE-GDA0001983218870000034
Figure BDA0001857197540000034
means, x, representing the relative entropy calculated from all classification models KLDu,minRepresenting the data sample with minimum entropy in the unlabeled data set obtained according to the formula (1), C representing the number of classifiers of the query committee, the data set of the classifiers being dynamically updated DlClassification committee C ═ θ1,...,θmThe members of the classifiers of the classification Committee represent different classification strategies, all of which are capable of calculating the current label, P, for unlabeled dataC(yiX) label y for representing the classification model of the query committee for the label to be classifiediAverage value of the probabilities of D (P)θ(C)/PC) Represents a classification model θiRelative entropy of information for other models.
Preferably, the threshold value in step 4 is set to an empirical threshold value of 0.1, if the threshold value is set to the empirical threshold value
Figure BDA0001857197540000041
If the value satisfies the threshold range, then θ is usediFor xu,minThe sample in (1) is pseudo-marked.
Preferably, to solve the model solution problem, the following segmented optimization strategy is adopted, and the optimization process is as follows:
A. initializing a system: before the system starts to run, a part of samples are taken out from a sample collection pool and are delivered to a field expert for manual labeling, and the collection is marked as DlThe sampling mode of the initial mark combination is random sampling from the sample set, and D is used for samplinglData set completion pair classification model θ1As a basis for subsequent classification of label-free data;
B. active selection of unlabeled samples: using a classification model theta1Predicting each unmarked sample, calculating the information entropy of each sample according to a formula, and sorting and taking out the sample x with the maximum information entropyu,maxHanding over to a domain expert for manual labeling, and putting xu,maxAdding a tagged data set Dl
C. Pseudo marking processing of the sample with the highest degree of certainty: sample x with lowest entropyu,minTake out, inCalculating relative entropy, i.e. KLD, according to a formula, comparing KLD with a threshold value, if the threshold value is met, then for xu,minLabel, and apply xu,minAdding a tagged data set Dl
D. Updating a classification model: using a marker data set DlRetraining the classification model θ1And then loops until a termination condition is satisfied.
Has the advantages that:
1. the information entropy and the relative entropy are fused to cooperatively work, a hybrid active learning query strategy is adopted, the low information entropy is further analyzed by using the relative entropy, and the information contained in the sample is more fully utilized, so that the software defect prediction performance is improved, and the defect module of the software is more quickly discovered.
2. By adopting a mixed active learning query strategy, enterprises only need to invest relatively less labor marking cost in the early stage, so that better software defect prediction capability is obtained, better control cost can be realized on the premise of meeting the enterprise requirements, and manpower is saved.
Drawings
FIG. 1 is a schematic diagram of an algorithm flow of a software defect prediction method of a hybrid active learning strategy according to the present invention
FIG. 2 is a schematic diagram of AUC (AUC index) indexes of Equinox data sets under different active learning query strategies
FIG. 3 is a schematic diagram of AUC indicators of an Eclipse JDT Core dataset under different active learning query strategies
FIG. 4 is a graph of AUC index intent of the Apache Lucene data set under different active learning query strategies
FIG. 5 is a schematic diagram of AUC index of Mylyn data set under different active learning query strategies
FIG. 6 is a graph of AUC index intent of Eclipse PED UI datasets under different active learning query strategies
Detailed Description
In this section, we will describe in detail the Cost-sensitive information Entropy and relative Entropy based collaborative Active Learning strategy (Cost-Effective Entropy-Learning), which is abbreviated as uncertaintiykl model. The learning strategy is applied to a common AEEEEM data set in software defect prediction, and a classifier can realize better classification indexes when the same data mark amount is achieved by selecting an improved classification model of strategy increment.
UNCERTAINTYKL active learning strategy
The uncertaintiykl model integrates the idea of cooperative training into the learning strategy of active learning, and completes the creation of the model by minimizing the number of marks. The initiative learning strategy based on uncertainty information entropy selects the data to be labeled with the highest information entropy from the unlabeled data, and the data to be labeled with the lowest information entropy is labeled by the domain expert, and voted by the query committee according to KLD, if the relative entropy of the sample calculated by the query committee is low (the experience threshold set at present is 0.1), the sample is added into the labeled data set, and meanwhile, the label of the prediction result is used as the pseudo label of the data, specifically, the following steps are carried out
xu,max=argmax(-∑iPθ(yi/x)logPθ(yi/x)) (1)
Defining a data set containing l marked examples
Figure BDA0001857197540000064
Defining a data set containing u unlabeled data sets
Figure BDA0001857197540000063
Denotes the ith unlabeled example (i ═ 1,2, … u), xu,maxRepresents the data instance with the maximum entropy in the unmarked data set obtained according to (1), pθ(yi/x) represents x in the case of data distribution based on labeled data setsu,maxBelong to the category yiPredicted probability value of argmax (-. Sigma)iPθ(yi/x)logPθ(yi/x)) represents that the data sample with the maximum concentration value of the unlabeled data is selected.
Figure BDA0001857197540000061
Figure BDA0001857197540000062
Figure BDA0001857197540000071
xu,minRepresenting the data sample with the minimum information entropy in the unmarked data set obtained in the step (1), C represents the number of classifiers of the query committee, and the data set of the classifiers is dynamically updated Dl. Classification committee C ═ θ1,...,θmThe classifier members of the classification committee represent different classification strategies, and are able to compute the current label on unlabeled data. PC(yi/x) is used for representing the label y of the classification model of the query board for the label to be classifiediAverage of the probabilities of (a). D (P)θ(C)/PC) Represents a classification model θiFor the relative entropy of the information of the other models,
Figure BDA0001857197540000072
representing the mean of the relative entropies calculated from all classification models KLD. If the relative entropy is less than the threshold, then θ is usediAnd pseudo-marking the example to be marked.
The algorithm noted above is specifically as follows:
a segmented optimization strategy is used to solve the model solution problem, and the optimization process is shown as follows:
system a initialization
Before the system starts to run, a part of samples are taken out from a sample set pool and are delivered to a field expert for manual labeling, and the set is marked as Dl. The initial label is randomly sampled from the sample set by DlData set completion pair classification model θ1As a basis for subsequent classification of label-free data;
b active selection of unlabeled samples
Using a classification model theta1Predicting each unmarked sample, calculating the information entropy of each sample according to a formula (1), and sorting and taking out the sample x with the maximum information entropyu,maxHand labeling is carried out by field experts, and x is addedu,maxAdding a tagged data set Dl
C pseudo-mark processing of the sample with the highest degree of certainty
Sample x with lowest entropyu,minTaking out, calculating relative entropy according to formulas (2), (3) and (4), namely KLD, comparing the KLD with a threshold value, and if the threshold value is met, comparing xu,minLabel, and apply xu,minAdding a tagged data set Dl
D classification model update
Using a marker data set DlRetraining the classification model θ1And then loops until a termination condition is satisfied.
The whole algorithm process can be summarized as algorithm 1
Method for solving UNCERTAINTYKL model by using algorithm 1 sectional strategy
1: input: initialization marker dataset Dl={x1,...xl}, unlabeled dataset Du={xl+1,...xl+u}, data flag y1i,...ylMaximum number of cycles UmaxKLD threshold value threshold
2: output: set of classification committees
Figure BDA0001857197540000081
Classification Performance collections
Figure BDA0001857197540000082
3: initializing a classification model of a query committee using tagged data
4: while current loop times < maximum loop times | | | unconverged do
5:for i<-1to Umax do
6: takeout Committee Member θiTraining on labeled data setsModel (model)
7: for current x(i)Calculating the classification probability P according to argmax (-Sigma)iPθ(yi/x)logPθ(yi/x)) calculates corresponding information entropy
8 sample x with maximum information entropy(i)Manual labeling is carried out by domain experts, and a labeled data set is added
9: sample x with minimum information entropy(t)The relative entropy, i.e. the average KLD value, is calculated according to equations (2), (3) and (4).
10:Dl=Dl∪x(i);Du=Du\x(i)
11:if KLD>threshold:Dl=Dl∪x(t);Du=Du\x(t)
12:i=i+1
13:end for
14:end while
15:
Figure BDA0001857197540000091
Design of experiments
Evaluating an object
The influence of the uncertaintiyKL query strategy on the active learning adopted strategy in the field of software defect prediction is analyzed and evaluated on the open data set AEEEM, and the AEEEM data set is used for evaluating different learning strategies. The data set is widely applied to the field of software defect prediction. The data set is used as a reference data set for performance comparison in the field of software defect prediction. The data set provides 61 metrics, including software development process metrics, etc., and all 61 metrics were used for classifier modeling in this experiment. AEEEM dataset summary information is shown in table 1.
Table 1 summary of AEEEM data set
Figure BDA0001857197540000092
Figure BDA0001857197540000101
Experimental setup
The experiment adopts 5-by-2 cross validation to carry out the experiment, the data is randomly and hierarchically sampled in each experiment, half of the data is used as training data, and half of the data is used as test data, so that the situation that the evaluation results are not independent due to data overlapping between the training data and the test data is prevented. And (3) taking a certain proportion of data from the training data for manual labeling, wherein the proportion of initial labeled data in the experiment is 30%, training a classification model by using an initial labeled data set, and selecting the rest 70% of data as unlabeled data according to an active learning strategy. The classifier in the experiment supports a vector machine (supporting vector machine) to be trained using RBF kernels implemented by libsvm and default parameters. A random forest classifier RandomForestClassifier realized by sklern is used in the UNCERTAINTYKL query strategy, and default parameters are used in training. In the UNCERTAINTYKL strategy, iteration is carried out on unlabeled data, a sample with the highest uncertainty is selected for manual labeling in each iteration, meanwhile, the sample with the lowest certainty is selected for further judgment by using KLD, and the threshold value of the judgment is set to be 0.1.
Evaluation index
The AEEEM data set has the problem of class imbalance, the performance of an active learning query strategy can be better reflected by using an AUC (area under ROC curve) index, and the AUC (area under ROC curve) index is one of the most waiting indexes when software defect prediction is carried out. The index is based on a ROC curve, and the ROC curve is called a receiver operating characteristic (receiver operating characteristic) curve. The confusion matrix of the binary classification model in the software defect prediction is shown in table 2.
TABLE 2 confusion matrix
Figure BDA0001857197540000111
The ROC curve defines the False Positive Rate (FPR) as the X-axis and the True Positive Rate (TPR) as the Y-axis. TPR: the rate of modules that are correctly determined to be defective in all samples that are actually defective. FPR: the rate of modules that are erroneously determined to be defective in all samples that are actually non-defective.
TPR=TP/(TP+FN)
FPR=FP/(TN+FP)
The AUC value corresponds to the area under the ROC curve, the value range of the AUC value is between 0 and 1, and the larger the value is, the better the performance of the model is.
Reference method
The invention uses the following three active learning query strategies as baseline to compare with the UNCERTAINTYKL active learning strategy:
(1) random sampling strategy (random): randomly selecting a query example from the unlabeled data, submitting the query example to a domain expert for labeling, and adding the labeled query example into a training data set;
(2) uncertainty information entropy based sampling strategy (uncertaintiy): training a training data set based on an SVM classifier, predicting each unmarked data, calculating information entropy, and selecting an instance with highest uncertainty for marking;
(3) query committee-based active learning strategy (committee): and querying the labeling sample by using an SVM and a random forest construction query committee. Meanwhile, all training data are used for training, testing is carried out on the testing data so as to be convenient for comparison with other active learning strategies, and the training performance of the model can be approximately considered as the best training data model on a training set.
Results and analysis of the experiments
The results of the analysis plots of fig. 2 to 6 are shown in table 3 below, in detail as follows:
table 3: AUC value comparison (mean + standard deviation). best performance based on predicted t-tests confidence of 95% is marked in bold
Figure BDA0001857197540000121
Table 4: win/tie/loss comparative analysis of uncertaintiykl model and other models under different labeling proportions
Figure BDA0001857197540000131
As shown in tables 3-4, fig. 2-6 illustrate the variation of AUC values for different active learning query strategies for different instances of annotation.
Table 3 shows the change of AUC values at the indicated sample ratios of 10%, 20%, 30%, 40%, 50%. When the proportion of the marked samples is more than 50%, the UNCERTAINTYKL active learning strategy already finishes marking all samples by using a pseudo-marking method, and no unmarked samples exist. Statistical analysis was performed using 95% confidence paired t-tests, and the best performing model was labeled. Table 4 performs win/tie/loss analysis on different learning strategies under different labeling ratios, and counts the comparison results between the uncertaintiykl learning strategy and the strategies such as commit, random, uncertaintiy, etc.
Firstly, the evaluation indexes can basically keep rising trend after the marked samples are added into the marked data set along with the gradual reduction of the unmarked samples, and the trend shows the effectiveness of the learning strategy of active learning; the uncertainty sampling strategy has good effect, and proves that the strategy can be used as a baseline strategy in the active learning field; an active learning strategy based on a query committee shows great instability in the AEEEM data set; under most conditions, the UNCERTAINTYKL has the best effect, is greatly improved compared with other active learning strategies, and the performance is improved by 13 percent under most conditions.
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims (4)

1. A software defect prediction method based on a hybrid active learning strategy is characterized in that a cost-sensitive information entropy and relative entropy collaborative active learning method is adopted in the software defect prediction method based on the hybrid active learning strategy, which is called an UNCERTAINTYKL model for short, the UNCERTAINTYKL model uses information entropy as an evaluation index of high-quality samples, samples with higher information entropy are selected from unlabeled sample data to be labeled manually, and meanwhile, the relative entropy is used for further analyzing the samples with low information entropy, so that a labeled data set is further expanded;
the uncertaintiykl model comprises the following steps:
step 1: calculating the information entropy of each unmarked sample data through an information entropy calculation formula;
step 2: selecting a data sample with the highest information entropy from the unlabeled sample data, manually labeling the data sample by a domain expert, and adding a labeled data set after the labeling is finished;
and step 3: screening unlabeled sample data with the lowest residual information entropy in the step 2, and labeling by using a relative entropy calculation mode;
and 4, step 4: presetting a relative entropy threshold, if the relative entropy is lower than the threshold, adding the sample into the marked data set, and simultaneously using the mark of the prediction result as a pseudo mark of the data; if the relative entropy is higher than the threshold value, abandoning the processing of the sample;
in order to solve the problem of solving the UNCERTAINTYKL model, the following sectional optimization strategy is adopted, and the optimization process is as follows:
A. initializing a system: before the system starts to run, a part of samples are taken out from the sample set pool and are manually labeled by field experts, and the samples are recorded as a labeled data set DlThe sampling mode of the initial label combination is random sampling from the sample set and is marked by the label data set DlComplete the class model θ1As a basis for subsequent classification of label-free data;
B. active selection of unlabeled samples: using a classification model theta1Predicting each unmarked sample, and calculating each unmarked sample according to a formulaThe information entropy of the samples is sorted and the sample x with the maximum information entropy is taken outu,maxHanding over to a domain expert for manual labeling, and putting xu,maxAdding a tagged data set Dl
C. Pseudo marking processing of the sample with the highest degree of certainty: sample x with lowest entropyu,minTaking out, calculating relative entropy according to formula, i.e. KLD, comparing KLD with threshold value, if threshold value is satisfied, comparing xu,minLabel, and apply xu,minAdding a tagged data set Dl
D. Updating a classification model: using a marker data set DlRetraining the classification model θ1And then loops until a termination condition is satisfied.
2. The method of claim 1, wherein the data sample with the highest entropy in step 2 is calculated as follows:
xu,max=arg max(-∑iPθ(yi/x)log Pθ(yi/x)) (1)
where i denotes the ith unlabelled example (i ═ 1, 2.. u), yiIndicates the value of the label, x, to which the label to be classified belongsu,maxRepresents the data sample with the maximum information entropy in the unmarked data set obtained according to the formula (1), and belongs to the category yiPredicted probability value of pθ(yi/x) represents the probability based on the distribution of the labeled data set data.
3. The method of claim 1, wherein the relative entropy calculation in step 3 comprises the following formula:
Figure FDA0003473234160000021
Figure FDA0003473234160000022
Figure FDA0003473234160000031
Figure FDA0003473234160000032
means, x, representing the relative entropy calculated from all classification models KLDu,minRepresenting the data sample with the minimum information entropy in the unmarked data set obtained according to the formula (1), C representing the number of classifiers of the query committee, and the data set of the classifiers is dynamically updated DlClassification committee C ═ θ1,...,θmThe classifier members of the classification committee represent different classification strategies, and are able to calculate the current label, P, for unlabeled dataC(yi/x) label y for representing the label of the classification model of the query committee to the label to be classifiediAverage value of the probabilities of D (P)θ(C)/PC) Represents a classification model θiRelative entropy of information for other models.
4. The method of claim 3, wherein the threshold is set to 0.1 in step 4, if the threshold is set to 0.1
Figure FDA0003473234160000033
The value satisfies the threshold range, then θ is usediFor xu,minThe sample in (1) is pseudo-marked.
CN201811319619.3A 2018-11-07 2018-11-07 Software defect prediction method based on hybrid active learning strategy Active CN109656808B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811319619.3A CN109656808B (en) 2018-11-07 2018-11-07 Software defect prediction method based on hybrid active learning strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811319619.3A CN109656808B (en) 2018-11-07 2018-11-07 Software defect prediction method based on hybrid active learning strategy

Publications (2)

Publication Number Publication Date
CN109656808A CN109656808A (en) 2019-04-19
CN109656808B true CN109656808B (en) 2022-03-11

Family

ID=66110556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811319619.3A Active CN109656808B (en) 2018-11-07 2018-11-07 Software defect prediction method based on hybrid active learning strategy

Country Status (1)

Country Link
CN (1) CN109656808B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353291B (en) * 2019-12-27 2023-08-01 北京合力亿捷科技股份有限公司 Method and system for calculating optimal annotation set based on complaint work order training text
CN111506504B (en) * 2020-04-13 2023-04-07 扬州大学 Software development process measurement-based software security defect prediction method and device
CN111400617B (en) * 2020-06-02 2020-09-08 四川大学 Social robot detection data set extension method and system based on active learning
CN111914061B (en) * 2020-07-13 2021-04-16 上海乐言科技股份有限公司 Radius-based uncertainty sampling method and system for text classification active learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166706A (en) * 2014-08-08 2014-11-26 苏州大学 Multi-label classifier constructing method based on cost-sensitive active learning
CN104899135A (en) * 2015-05-14 2015-09-09 工业和信息化部电子第五研究所 Software defect prediction method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166706A (en) * 2014-08-08 2014-11-26 苏州大学 Multi-label classifier constructing method based on cost-sensitive active learning
CN104899135A (en) * 2015-05-14 2015-09-09 工业和信息化部电子第五研究所 Software defect prediction method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Large Margin Distribution Learning;Yu-Hang Zhou;《IEEE》;20160731;1749-1762 *
基于小样本的作战系统失效预测模型研究及应用;杨杰;《中国优秀硕士学位论文全文数据库信息科技辑》;20180315;第5-56页 *

Also Published As

Publication number Publication date
CN109656808A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109656808B (en) Software defect prediction method based on hybrid active learning strategy
CN112069310B (en) Text classification method and system based on active learning strategy
CN108875816A (en) Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
CN107292097B (en) Chinese medicine principal symptom selection method based on feature group
CN113326731A (en) Cross-domain pedestrian re-identification algorithm based on momentum network guidance
CN105306296A (en) Data filter processing method based on LTE (Long Term Evolution) signaling
CN108596204B (en) Improved SCDAE-based semi-supervised modulation mode classification model method
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
NL2029214A (en) Target re-indentification method and system based on non-supervised pyramid similarity learning
CN110796260B (en) Neural network model optimization method based on class expansion learning
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN106156107B (en) Method for discovering news hotspots
CN110955811B (en) Power data classification method and system based on naive Bayes algorithm
CN112306730B (en) Defect report severity prediction method based on historical item pseudo label generation
CN112306731B (en) Two-stage defect-distinguishing report severity prediction method based on space word vector
CN112199287B (en) Cross-project software defect prediction method based on enhanced hybrid expert model
CN114896402A (en) Text relation extraction method, device, equipment and computer storage medium
CN111382191A (en) Machine learning identification method based on deep learning
Singh et al. Feature selection using classifier in high dimensional data
CN108573059A (en) A kind of time series classification method and device of feature based sampling
CN104063591B (en) One-dimensional range profile identification method for non-library target based unified model
CN117611957B (en) Unsupervised visual representation learning method and system based on unified positive and negative pseudo labels
Luo et al. Class-specific attribute reduction based on neighborhood conditional entropy
CN114781554B (en) Open set identification method and system based on small sample condition
CN116738551B (en) Intelligent processing method for acquired data of BIM model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant