CN111782807A

CN111782807A - Self-acceptance technology debt detection and classification method based on multi-method ensemble learning

Info

Publication number: CN111782807A
Application number: CN202010568813.6A
Authority: CN
Inventors: 殷茗; 徐悦然; 田嘉毅; 朱奎宇; 马怀宇; 张小港; 薛禹坤; 吴瑜
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-16
Anticipated expiration: 2040-06-19
Also published as: CN111782807B

Abstract

The invention relates to a self-supporting recognition technology detection classification method based on multi-method ensemble learning, which comprises five steps: preprocessing the characteristic words; selecting the first k most useful features to train a classifier; training corresponding sub-classifiers by using a naive Bayes polynomial and a linear Logistic regression method; integrating and predicting the prediction result through a sub-classifier voting rule to obtain accuracy and recall rate, and finally calculating an F1 value as a subsequent evaluation standard by integrating the accuracy and the recall rate; finally, clustering the characteristics which frequently appear in the experimental process and have high information gain values by a clustering method, and further classifying the detected technical debts.

Description

Self-acceptance technology debt detection and classification method based on multi-method ensemble learning

Technical Field

The invention belongs to the technical field of software development, and particularly relates to a self-acceptance debt classification detection method based on multi-method ensemble learning.

Background

The document "Huang Q, Shihab E, Xia X, et al.identifying self-adapting technical ebt in open source project using minimum [ J]Empirical software engineering,2017 discloses a method for automatically detecting self-acceptance technical debts using an integrated classifier. The method uses source code annotations from different software items to analyze the annotations to be detected. The method comprises preprocessing a source file, selecting and screening features by using the features, and using a naive Bayes polynomial (A)

Bayesian multinomial) trains each classifier, and finally an integrated classifier composed of a plurality of classifiers predicts according to voting rules to determine whether the statement has self-acceptance technical debt. The method is verified to have a result which is greatly improved compared with a text mode-based method and an NLP classifier method, and the method has excellent operation performance. However, the training method of the classifier is single, and the accuracy of the individual classifier is low, so that the final result is not accurate.

Self-acceptance technical debt (SATD) is a term that has been proposed to express debt intentionally introduced during software development, usually to enable a project to be developed more quickly in the short term, at the expense of maintenance needed in the future, and the focus of the present invention is to accurately detect SATD in order to help mitigate the cost of maintaining SATD. However, some methods strongly depend on manual detection so far, and many high-level methods adopt a single natural language detection mode to automatically identify the SATD, however, the disadvantages of low efficiency of the manual detection method are obvious, the mode classifier of the single natural language detection has low performance and poor flexibility, and although the previous detection also has good results, the SATD has the characteristics of diversity, semantic change and the like in the projectDetecting SATD presents significant challenges. Therefore, in order to improve the accuracy and flexibility of SATD detection, the invention provides a multi-method ensemble learning SATD detection method, which takes 8 open-source items as a data set, firstly preprocesses an annotation text, extracts features by using a feature selection method, and then uses the feature selection method

A Bayes Multinomial and Simple Logistic method is used for integrally training a classifier, finally, sub-classifiers are integrated to form an integrated classifier, annotated classification labels are obtained according to voting rules, and technical debts are accurately identified and self-accepted. Finally, the experimental result is compared with three experimental baselines (based on a mode, single method ensemble learning and NLP), and the result shows that the SATD detection method of the method is high in accuracy, remarkably improves recall rate, can achieve better detection effect and is obviously superior to the previous detection method.

Disclosure of Invention

Technical problem to be solved

The key technical difficulty of the self-acceptance technology debt detection problem in the software development process is as follows: the self-acceptance technical debt is source code level in nature, but generally needs to be analyzed by detecting source code annotations and decomposing sentences into feature words, and when a classifier is trained, the training model is single, and the result error is large. The invention focuses on the integrated learning of multiple methods to detect the debt of self-acceptance technology, and firstly preprocesses the characteristic words in the invention. Stop words and punctuation marks are removed, only effective words are taken for feature selection, and invalid features are filtered to reduce noise. Also consider word proximity, such as: words with the same word stem, such as happy, and happy, are unified to the word stem by a port; the first k most useful features are then selected to train the classifier. The invention selects the first k most useful characteristics for training; secondly using a naive Bayesian polynomial (

Bayes Multinomial) and linear Logistic regression (Simple Logistii)c) The two methods train corresponding sub-classifiers to enable the F1 value predicted each time to be the best, and improve the accuracy of prediction as much as possible while predicting as many self-acceptance technical debts as possible. And finally, performing integrated prediction on the prediction result trained by the sub-classifiers in the invention by using the voting rule of the sub-classifiers, and determining the prediction result of the final integrated classifier in each process.

Technical scheme

A self-acceptance technology debt detection and classification method based on multi-method ensemble learning is characterized by comprising the following steps:

step 1: preprocessing the characteristic words

Raw annotation data was processed using heuristic rules:

(1) deleting the license description annotation with the fixed format automatically generated by the compiler;

(2) merging the multiple lines of annotations into a sentence;

(3) deleting code present in the annotation statement;

(4) deleting the Javadoc which does not contain the reserved words, and reserving the comment sentences containing the reserved words;

step 2: the first k most useful features are selected to train the classifier

After text pre-processing of source item annotations, the present invention uses a vector space model, VSM, to process words that have been divided into features; in the model, each sentence annotation is represented by a word vector, the divided word characteristics can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space; the invention uses HashMap as the mapping of VSM model, wherein character type identification is divided characteristics, double-precision numerical value is word frequency, namely the frequency of the characteristics appearing in the current annotation, and the frequency is standardized;

information gain, a widely used feature selection method, is employed to select useful features: let the annotation data set be denoted as C { (C)₁,L₁),(C₂,L₂),...,(C_N,L_N)}，C_iRepresents the ith comment, L_iRepresenting the annotationClass labels, i.e. yes (t) no

Presence of self-acceptance technical debt; yet to order C_i＝{w₁,w₂,…,w_nWhere n represents the notation C_iNumber of features in, w_iRepresents the ith feature in the sentence annotation; for a feature w and an annotation C_iThere are 4 possible relationships between them:

● (w, t): note C_iContains the feature w and there is a self-acceptance technical debt (i.e., t) in the sentence annotation

●

Note C_iThe feature w is included, but no self-acceptance technical debt is present in the sentence annotation (i.e.,

)

●

note C_iNot including the feature w, but there is a self-acceptance technical debt (i.e., t) in the sentence annotation

●

Note C_iNo feature w is included and there is no self-acceptance technical liability in the sentence annotation (i.e.,

)

based on the above 4 possible relationships, the information gain of the feature w and the tag t is calculated as follows:

wherein p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t';

after an information gain value corresponding to each feature is calculated by using an information gain method, sorting the features from large to small according to the size of the information gain value; the higher the score, the more important the feature is in predicting the classification label; the method selects the characteristic of which the information gain value is k% at the top, and discards other characteristics;

and step 3: training sub-classifiers using naive Bayes polynomial and linear Logistic regression

(1)：Native Bayes Multinomial

The invention sets six classifiers, namely No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 classifiers as a polynomial naive Bayes classifier NBM, and trains by using an NBM method; let the annotations collection be C_i＝{w₁，w₂，...，w_nH, class label L_iThe following can be obtained:

applying bayes' theorem on equation (3) yields:

identifying a classification tag for the annotation by equation (4);

(2)：Simple Logistic

in the experiment, two classifiers, namely the classifier No. 1 and the classifier No. 7 are set as a linear logistic regression classifier SimpleLogistic; let the annotation data set be denoted as C { (C)₁，L₁)，(C₂，L₂)，...，(C_N，L_N) In which C is_iRepresents the ith comment, L_iClass labels representing the note, i.e. whether or not there is self-supportRecognizing technical debts; in addition, it also needs to order C_iIs represented as C_i＝{w₁，w₂，...，w_nWhere n represents the notation C_iNumber of features in, w_iRepresents the ith feature in the sentence annotation; according to the linear logistic regression theorem, the following can be obtained:

z＝θ₁w₁+θ₂w₂+…+θ_nw_n+θ₀＝θ^TC_i(5)

it is substituted into a sigmoid function, which is expressed as follows:

according to the final result of the sigmoid function, dividing the annotations to be detected into two categories, wherein the category label value of 1 is the annotation statement with the self-acceptance technical debt;

and 4, step 4: sub-classifier voting rules

Adopting a voting rule to take the classification label result predicted by the majority of sub-classifiers as the final prediction result of the integrated classifier;

and 5: clustering for self-acceptance technical debt classification

The invention re-screens and deletes the original data according to the characteristics selected by the information gain value in the steps, the frequency of the appearance of the characteristics, the position of the appearance of the characteristics and the characteristics of developers, and finally classifies the characteristic words by using a clustering method.

The k% in step 2 is 10%.

Advantageous effects

The self-acceptance technology debt detection and classification method based on multi-method ensemble learning provided by the invention solves the optimization problem of self-acceptance technology debt detection in the software development process. The invention fully considers the detection characteristics of the debt of the self-acceptance technology, and trains classifiers of different data sets by using corresponding methods through text preprocessing, creatively provides a method for selecting characteristics by using an evaluation information gain value, then improving the performance of the training classifier by using different classifier training methods for different classifiers, and integrating the sub-classifiers to form an integrated classifier, so that the final detection prediction result is optimized, the detection precision and the detection range are improved, the classification indexes tend to be balanced, and the detection indexes are obviously improved. Finally, the clustering method is used for classifying the features, and the clustering result is used for analyzing the type of the self-acceptance technical debt to which the features belong, so that the detection and classification effects on the self-acceptance technical debt are achieved.

The invention is used as a self-acceptance technology debt detection and classification technology based on multi-method integrated learning, fully considers the attribute characteristics of self-acceptance technology debt in the software development process, aims to improve the software quality as much as possible and reduce the hidden risk criterion in the software development process, carefully analyzes the characteristics of the self-acceptance technology debt in the software development process, adopts information gain to quantify characteristic influence factors, and refines the classification detection process. Meanwhile, in the feature training process, a self-acceptance technology debt classification detection technology using a naive Bayesian polynomial and linear Logitics regression is innovatively provided, and finally a clustering method is used for classifying the features so as to obtain the detection and classification effects of the self-acceptance technology debt. According to the method, the experimental result is compared with results of four different self-acceptance technology debt detection methods such as a Potdar and Shihab mode-based method, an integrated learning evaluation index using a single method, an optimal sub-classifier, a Natural Language (NLP) based maximum entropy classifier method and the like, the accuracy, recall rate and F1 values of the results obtained by the four different technical methods are improved to different degrees, and particularly, the comprehensive F1 value is respectively improved by 51.87%, 16.22%, 28.76% and 32.12% compared with other methods. And finally classifying the detected characteristics with the self-acceptance technical debts to obtain a final result.

Drawings

FIG. 1 is a flow chart of the method of the present invention

FIG. 2 graph of entropy versus probability

FIG. 3sigmoid function curve

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention discloses a self-supporting technology detection classification method based on multi-method ensemble learning. The method mainly comprises five core steps: preprocessing the characteristic words; selecting the first k most useful features to train a classifier; using a naive Bayesian polynomial (

Bayes Multinomial) and linear Logistic regression (Simple Logistic) to train corresponding sub-classifiers; and integrating and predicting the prediction result through the sub-classifier voting rules to obtain precision (precision) and recall rate (call), and finally calculating an F1 value (F1-score) as a subsequent evaluation standard by integrating the precision and the recall rate. Finally, clustering the characteristics which frequently appear in the experimental process and have high information gain values by a clustering method, and further classifying the detected technical debts.

Step 1: preprocessing the characteristic words

The invention considers that some characteristic words are invalid, such as stop words, punctuation marks and the like, and also considers the similarity of words, for example: happy, happenses, happier, etc. have similar contents with the same stem, so the words are unified into the stem using the Porter stem algorithm.

The invention processes raw annotation data using heuristic rules:

(1) and deleting license description class comments with a fixed format automatically generated by a compiler, such as comments of functions such as automatically generated constructors and the like and catch code comment blocks automatically generated. The comments before the class declaration also do not usually contain self-acceptance technical debt, so the comments before the class declaration are also deleted.

(2) Developers sometimes write a long annotation in multiple lines rather than directly in the form of an annotation block. This annotation writing can cause a one-sentence-length annotation to be mistaken for a multi-sentence annotation. So the multiple line annotations in this case are merged into one sentence.

(3) In a software project, there is a lot of source code in the form of annotations. These codes are annotated out may be due to the code not being used on the one hand and to the code being used only for debug on the other hand. These codes present in the annotation statement typically do not contain self-acceptance technical liability and can therefore be deleted.

(4) Javadoc annotations generally do not have a self-acceptance technical debt, while some Javadoc that contain a self-acceptance technical debt generally contain reserved words such as "todo", "fixme" or "XXX". Therefore, the invention deletes the Javadoc which does not contain the reserved words and reserves the comment sentences containing the reserved words.

Step 2: the first k most useful features are selected to train the classifier

After text pre-processing of the source item annotation, the present invention uses a Vector Space Model (VSM) to process words that have been classified as features. In this model, each sentence annotation is represented by a word vector, and the divided word features can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space. The present invention uses HashMap as the mapping for the VSM model, where the character type identification is the divided feature, the double-precision value is the word frequency, i.e. the number of times a feature appears in the current annotation, and the number of times is normalized.

It can be found by reading the source item annotation that each source item is marked off by text preprocessing with a large number of features, for example 3661 features in the argomaml project. When a space vector model is used, the dimensionality is too high, so that the experimental performance is reduced, and the final result is even influenced. And through manual simple analysis, the problem that the annotations with the self-acceptance technical debt are in a small number, namely the number is unbalanced can be solved, and the difficulty is increased for the detection of the self-acceptance technical debt.

To address the above issues, the present invention uses feature selection to extract a subset of features that are most useful for classification when performing self-acceptance technique liability detection. Feature selection can significantly improve classifier classification performance based on previous research and practice. The present invention employs information gain, a widely used feature selection method, to select useful features.

Let the annotation data set be denoted as C { (C)₁，L₁)，(C₂，L₂)，...，(C_N，L_N)}，C_iRepresents the ith comment, L_iClass labels representing the piece of annotation, i.e. yes (t) no (t)

There is a self-acceptance technical debt. Yet to order C_i＝ {w₁，w₂，...，w_nWhere n represents the notation C_iNumber of features in, w_iRepresenting the ith feature in the sentence annotation. For a feature w and an annotation C_iThere are 4 possible relationships between them:

●

)

●

●

)

where p (w ', t') represents the probability of the feature w 'appearing in the annotation with the label t', p (w ') represents the probability of the feature w' appearing in the annotation, and p (t ') represents the probability of the annotation with the label t'.

The information gain may measure the amount of information needed to predict the class label given whether a feature is in the current annotation to be tested. After the information gain value corresponding to each feature is calculated by using the information gain method, the features are sorted from large to small according to the size of the information gain value. The higher the score, the more important the feature is in predicting the classification label. The invention selects the feature with the information gain value at the top k%, and discards the other features. By the method, the number of the features in the model construction stage can be reduced, and the number of the features in the prediction stage is also reduced, so that the experimental efficiency of the method is greatly improved. By default, the present invention empirically selects the top 10% of the total number of features, which makes the experimental results nearly optimal.

(1)：Native Bayes Multinomial

In the experiment, the classifiers No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 are set as polynomial naive Bayes classifiers (A)

Bayes Multinomial), trained using the NBM method. NBM emphasizes the concept of polynomial distribution on the basis of Naive Bayes (NB), and the principle of NBM is similar to NB and belongs to Bayesian methods. The main advantages of training a classifier using a bayesian-like method are its short computation time and high training performance, since it assumes that the given labels and features are conditionally independent. Thus, let the annotations set be C_i＝{w₁，w₂，...，w_nH, class label L_iThe following can be obtained:

applying bayes' theorem on equation (3) yields:

the class label of the annotation can be identified by equation (4). Note that in NB, only whether a feature is present in the current annotation to be tested is considered, and NBM is similar to NB, but the class label is determined by the number of times each feature appears in the annotation. From summary and experimentation, it can be concluded that the use of NBM performs better than the NB method when certain specific features occur a large number of times in the annotation set.

(2)：Simple Logistic

In this experiment, two classifiers, i.e., classifier No. 1 and classifier No. 7, were set as linear logistic regression classifiers (simplelogicc). The Simple Logistic method is based on Simple Logistic regression, multiple iterations are carried out by using a LogitBoost algorithm, parameters of a basic weak classifier are optimized through each iteration, and a high-precision model is finally formed. The number of iterations of the LogitBoost is 10 in the default situation, and if the iteration effect is not good, the optimal number of iterations can be obtained by using K-fold cross Validation (K-folderCross Validation).

Let the annotation data set be denoted as C { (C)₁，L₁)，(C₂，L₂)，...，(C_N，L_N) In which C is_iRepresents the ith comment, L_iThe classification label representing the piece of annotation, i.e. whether there is a self-acceptance technical liability. In addition, it also needs to order C_iIs represented as C_i＝{w₁，w₂，...，w_n}，Wherein n represents Note C_iNumber of features in, w_iRepresenting the ith feature in the sentence annotation. According to the linear logistic regression theorem, the following can be obtained:

z＝θ₁w₁+θ₂w₂+…+θ_nw_n+θ₀＝θ^TC_i(5)

it is substituted into a sigmoid function, which is expressed as follows:

and according to the final result of the sigmoid function, dividing the annotations to be detected into two categories, wherein the category label value of 1 is the annotation statement with the self-acceptance technical debt.

And 4, step 4: sub-classifier voting rules

Because the data training process is divided into a plurality of sub-classifiers for predicting the classification label result, the final accuracy of the integrated classifier is obviously improved under the condition of ensuring the accuracy of each sub-classifier. In the invention, the classification label result predicted by most sub-classifiers is taken as the final prediction result of the integrated classifier.

And 5: clustering for self-acceptance technical debt classification

According to the invention, the characteristics selected through the information gain value in the steps are re-counted and processed according to the frequency of the appearance of the characteristics, the position of the appearance of the characteristics and the characteristics of developers, and finally the characteristic words are classified by using a clustering method.

Firstly, counting the characteristic frequency in a source code, and selecting a characteristic word with high frequency. Then, considering the reasons for the human preference factor, some features that indicate that there is no technical debt but are referenced are deleted. There are also words of perceptual color that rarely appear in annotations where there is no self-acceptance technical liability, and thus are reserved for these needs. Finally, some emotional verbs are also deleted. The selection standard is to reserve the characteristics with larger influence weight for the final classification detection and to screen and delete the characteristics with small influence factors obtained by analysis.

The research method framework of the invention is divided into two stages: a model building phase and a prediction phase. In the model building phase, source items are first entered as a training data set, annotations in these source items having known classification labels, and then sub-classifiers are built for each individual source item. In the prediction phase, all the sub-classifiers are integrated to jointly predict whether a self-acknowledged technical liability exists for an annotation in the target. In order to make the result as accurate as possible, only one item is selected as a target item at a time for prediction, and other n-1 items are input into a model as source items, so that the other n-1 items are used for training a sub-classifier.

The method comprises the following steps: text pre-processing

Before selecting the feature value, the original item annotation is firstly subjected to text preprocessing. This is done because the desired features are core words and there are a large number of punctuation, stop words, etc. in the source item annotation. And many words all have the same word stem, these words can simplify to unify and use the word to express in the classification problem, so will not influence the classification effect, can raise the classification efficiency again, so the text processing can be divided into 3 steps:

(1) characterization: the source item annotation text is divided into words, phrases, symbols, or other meaningful elements. Experiments only retained the features containing the english letters, i.e. all punctuation marks were deleted first, and in addition, some word features were accompanied by punctuation marks or numbers, which also required the deletion of these characters and only the words, for example: "TODO:" to be characterized as "TODO". Finally, all the word features are converted into lower case.

(2) Deleting stop words: stop words are a class of words that are often used when writing annotations, and have little meaning to the problem of detecting self-acceptance debts that the present invention addresses, because stop words do not have a sense of practical significance to identify self-acceptance debts. Commonly used stop words include "I", "should", "to", "the", and the like. While there are many text mining efforts that provide a standard stop word list, some stop words are actually useful for classification against the self-acceptance technical liability detection problem. For example, a certain comment with self-acceptance technical debt is "TODO", wherein the phrase "short wave" carries information useful for classification, and the two words "short" and "wave" are usually defaulted as stop words. Therefore, the invention establishes a stop word list aiming at the problem of detecting the self-acceptance technical debt, wherein only a few prepositions which are useless for classification (such as the, the and the is) are contained. Words not longer than 2 or longer than 20 are also considered stop words by the present invention.

(3) Drying words: word desiccation is a process that unifies words (sometimes derivatives) into stems, roots, or shapes. For example, the words "stems", "stemmed" will both be reduced to "stem". The present invention uses the well-known "Porter stem parser" to implement word drying, thereby reducing redundant synonyms.

Step two: feature value selection

The features after the preprocessing are not all used for training the classifier, the classification efficiency of the classifier is too low, and the noise is increased due to too much feature quantity, and the first k most useful features are selected by using a certain feature selection method;

shannon first demonstrated the amount of information to select useful features by using a widely used feature method of information gain.

The information quantity is defined by using a logarithmic function, namely for an event x, the probability of occurrence is p (x), and the information quantity corresponding to the event x is defined as follows:

I(x)＝-log p(x) (7)

as can be seen from equation (7), the size of the information amount represents the size of the uncertainty of the event occurrence, and the smaller the uncertainty of the event occurrence, the smaller the information amount is; conversely, the greater the uncertainty of the occurrence of an event, the greater the amount of information that is present.

Shannon later proposed the definition of information entropy. Information entropy is the information needed to remove uncertaintyAnd (4) measurement of information quantity. It represents the expectation of the amount of information that an event may produce, considering all possible cases of the event. In general, information entropy is a measure of the amount of information. I.e. for variable X ═ { X-₁，x₂，...，x_i，...，x_nIts information entropy is defined as:

H(x)＝-i＝1np(xi)logp(xi) (8)

wherein, p (x)_i) Represents that the random variable X is X_iThe probability of (c). The information entropy only depends on the distribution of the random variable X and is irrelevant to the value of X. The magnitude of the entropy may indicate the magnitude of uncertainty in the probability that the random variable is at a certain value. The variation curve of the information entropy with the probability is shown in fig. 1.

As shown in FIG. 1, when p (x)_i) When 0 or 1, H (x)_i) 0, i.e. uncertainty 0; when p (x)_i) When not greater than 0.5, H (x)_i) The entropy reaches a maximum, i.e. when the random variable X takes X_iThe uncertainty is greatest.

If a certain precondition is added to the occurrence of an event, a conditional entropy H (X | Y) can be obtained, which represents the result of averaging the random variable X after the random variable Y has been taken over all possible cases, i.e.:

the final result of the conditional entropy can be obtained by bringing formula (10) into formula (9):

after the information entropy and the conditional entropy are known, the information gain can be defined in both of them. The information gain is defined as: the difference between the entropy of the information set to be classified and the conditional entropy of the information after a certain characteristic is selected is as follows:

IG(X|Y)＝H(X)-H(X|Y) (12)

the information gain can be easily understood by the above formula and related concepts. For a certain feature set X ═ X₁，x₂，...，x_i，...，x_nH (X | Y) is definite, and in the case that the condition is Y, the smaller the uncertainty of the feature X, the larger the information gain, which indicates that the feature performs better. Feature selection is performed by calculating information gain of features, and usually, the top k features with high IG values are selected or a certain threshold value is set for screening.

Data information in a data set is represented as features and annotations, and analyzed to find that four relationships may exist: containing features and having self-acceptance technical debts; contains features and is free of self-acceptance technical debt; no features are included but self-acceptance technical debt exists; no features are included and no self-acceptance technical debt exists.

Based on the above four possible relationships, the information gain of the feature w and the tag t is calculated as follows:

After the information gain corresponding to each feature is calculated by using the information gain method, the first k% of features are selected from big to small according to the size of the information gain, and other features are abandoned. In the experiments of the present invention, 10% of the total number of features was empirically selected.

Step three: training sub-classifiers

Using a naive Bayes polynomial on the basis of the selected feature values (a)

Bayes Multinomial) and linear Logistic regression (Simple Logistic) to train the corresponding sub-classifiers. The final sub-classifiers are combined togetherAnd the integrated classifier integrated by the multiple classifiers predicts the data to be detected.

Training on some sub-classifiers by using a naive Bayes method and on an annotation set C_i＝{w₁，w₂，...，w_nOn, the class label is L_iLabels and features are independent of each other, which can be conditional, and then a classification and label representing the annotation is obtained using bayesian theorem on a conditional basis.

Polynomial naive Bayes classifier (

Bayes Multinomial) is a specific example of a naive Bayes classifier. The naive bayes classifier emphasizes the independence of events under certain conditions, while the polynomial naive bayes classifier emphasizes that events obey a polynomial distribution, both of which are similar in principle.

The naive bayes classifier algorithm is a classification algorithm based on bayes rules. It has the following preconditions: suppose that given Y, event X ═ X₁，x₂，...，x_i，...，x_nMutually independent of each other. This assumption greatly simplifies the representation of P (X | Y) and simplifies the problems encountered in evaluating the data set. For event X ═ { X₁，x₂，...，x_i，...，x_nX given condition Y_iIndependently of one another, from which follows:

it is generally assumed that Y is an arbitrary discrete variable and that event X ═ X₁，x₂，...，x_i，...，x_nIs any discrete or real-valued variable. In training the classifier, the goal is for each instance x that needs to be classified_iThe probability distribution of possible values of Y is output. According to bayes rule, the probability expression that Y will take its kth possible value is:

now suppose x_iIndependently of condition Y, equation (2-8) can be rewritten as:

equation (16) is the basic equation of a polynomial naive bayes classifier. P (Y ═ Y)_k|x₁，x₂，...，x_n) Referred to as a posterior probability.

Now a new instance X is given from the data set_new＝{x₁，x₂，...，x_nAnd given a prior probability P (Y) and a conditional probability P (x)_iY), if the most likely Y value (i.e., classification label) is to be found, then the naive bayes classifier rule gives the following rule:

equation (17) is generally simplified as follows:

that is, y having the maximum value of the formula (18)_kI.e. the final classification result.

Other sub-classifiers are trained by linear logistic linear regression, the LogitBoost algorithm is added to the method for iteration on the basis of simple logic, and the optimal iteration times are obtained by using K-fold Cross Validation (K-Folder Cross Validation). And the prediction process is that the annotated data set is constructed into a linear logistic regression form and is brought into the sigmoid function, and finally label classification of 0 and 1 is carried out on the final result of the sigmoid function to predict whether the annotated data set is the annotated sentence with the self-acceptance technical debt.

First for a given dataset D { (x)₁，y₁)，(x₂，y₂)...(x_m，y_m) In which (x)_i，y_i) Denotes the ith sample, where x_i＝{x_i1，x_i2，...，x_inN features per datum; classification label y_i∈ {0, 1 }. assume x_iThe n characteristics of (a) are linear relationships, i.e.:

z＝θx_i+b＝θ₁x_i1+θ₂x_i2+…+θ_nx_n1+b (19)

for the sake of simplicity, b in the formula (19) is written as θ₀The following can be obtained:

z＝θ₁x_i1+θ₂x_i2+…+θ_nx_n1+θ₀＝θ^TX (20)

the invention aims to realize the problem of two-classification, and hopefully, the final function can intuitively display the classification result, so the sigmoid function is adopted, and the function is expressed as follows:

the formula (20) is introduced into formula (21) to obtain:

the function curve is shown in fig. 3:

as can be seen from the figure, the sigmoid function value field is [0, 1], it can be assumed that the final result is judged to be 1 when y is greater than 0.5, and the final result is judged to be 0 when y is less than 0.5, so that the second classification can be clearly realized. If further order:

P(y＝1|x；θ)＝h_θ(x) (23)

P(y＝0|x；θ)＝1-h_θ(x) (24)

a loss function of logistic regression can be obtained:

the result of equation (25) is minimized by θ, which is the parameter in equations (2-15), to obtain the functional representation of the final logistic regression classifier.

The Simple Logistic classifier adopted by the invention uses a LogitBoost algorithm on the basis of training a weak classifier by using Logistic regression. LogitBoost, first proposed by Schapire and Singer, is one of the enhanced algorithms developed in recent years. The Boosting algorithm was originally designed to combine several weak classifiers to improve classification performance, and later Freund and Schapire proposed a more practical enhancement algorithm AdaBoost, but this algorithm has the problem of overfitting when processing noisy data. For this case, Friedman et al propose the LogitBoost algorithm to linearly reduce the training error.

The concept of the LogitBoost algorithm is as follows:

◆ input dataset D { (x)₁，y₁)，(x₂，y₂)...(x_n，y_n) In which x_i∈X，y_i∈ Y { -1, 1}, and inputting the iteration number T.

◆ initialization weight

(i ═ 1.., N), f (x) ═ 0, probability

T repeats for the iteration number T-1.

a. Calculating weights and variables:

w_i＝p(x_i)*[1-p(x_i)](26)

b. with w_iFor weighting, a weighted least squares method is adopted to fit the weak classifier function f_t(x)，f(x_i) Representing a functional form of the weak classifier. The present invention trains weak classifiers using logistic regression functions.

c. Update F (x) and p (x) of the round of iterations:

output the final classification lf (x) ═ sign [ f (x) ].

Step four: sub-classifier voting rules

In the prediction phase, a classifier trained on the source item is required to predict the classification label of the annotation to be predicted in the target item. Since each project has annotations with different styles, and the feature distribution of the annotations is different, the experiment uses each sub-classifier to construct an integrated classifier. Each sub-classifier is trained by a method adaptive to own data according to the characteristics of different source items, and each sub-classifier has independence and cannot interfere with the prediction process of each other. Therefore, under the condition of ensuring the accuracy of each sub-classifier, the final accuracy of the integrated classifier is also obviously improved. The invention takes the classification label result predicted by most sub-classifiers as the prediction result of the final integrated classifier. Thus, the prediction process is like a voting election, with each sub-classifier "voting" to determine the final "winner" (i.e., the annotated classification label).

Table 1 gives the voting process for the sub-classifiers used to predict the classification label. The columns correspond to the set of sub-classifiers and the predicted outcome of each sub-classifier. The last action is gathered into the final output of the classifier. In the qualified example, there are a total of 7 sub-classifiers, assuming that the data used to train these 7 sub-classifiers is from 7 different source items. The prediction result of 3 sub-classifiers is "no self-acceptance liability" (Negative), and the prediction results of the other four sub-classifiers are "self-acceptance liability" (Positive), so the final output result of the integrated classifier is "self-acceptance liability for the annotation".

TABLE 1 example of sub-classifier voting

Step five: clustering for self-acceptance technical debt classification

And analyzing the characteristics obtained by the detection according to the information gain value in the detection process. The different projects available from the analysis may share some common features representing self-acceptance technical debts, such as "todo", "fixme", "workaround", "implementation", "hack", etc. But the frequency with which these features appear may vary from item to item. For example, some developers prefer to use the word "hack" for temporary fix problems, while others prefer to use the word "workkarround". Further, these words may also appear in notes where no self-acceptance technical debt exists. For example, when the word "instance" appears in a comment where there is no self-acceptance technical debt, it means that the developer has written code to implement some functionality (e.g., "instances backspace function"), for indicative purposes. However, in annotations where there is a self-supporting technical debt, the term "implementation" usually means that the developer needs to implement some functionality but has not yet completed (e.g., "Bunch of methods still not implemented").

Some developers prefer to use words with perceptual color (e.g., yuck, ugly, stupid, ill, etc.) when writing self-acceptance technical debt notes. These words with perceptual color rarely appear in annotations that are not in the self-acceptance of technical debts, but sometimes developers want to remind themselves to avoid writing low quality code (e.g., "guard against business reporting stuck"), in which case these words sometimes appear. In addition, developers prefer to use some emotional verbs, interrogative words, and comparison levels to address self-acceptance technical debt. Wherein, the emotional dynamic words comprise "should", "need", "can", "would", etc., the query words comprise "what", "how", "where", etc., and the comparative words comprise "better", "most", "fast", etc.

By reading comments containing self-acceptance technical debts, it can be concluded that in many cases developers have to repair a program quickly or implement a function quickly in a short time. That is, many self-acceptance technical debts are made under a state in which the developer is under time stress or emotional tension.

The invention processes and counts the original data set again, combines 5 self-acceptance technical debts proposed by Maldonado and Shihab in the research process, and finally classifies the counted characteristic values by using a clustering method, and the obtained results are shown in the table 2:

TABLE 2 typical feature Classification of self-acceptance technical debts

Claims

1. A self-acceptance technology debt detection and classification method based on multi-method ensemble learning is characterized by comprising the following steps:

step 1: preprocessing the characteristic words

Raw annotation data was processed using heuristic rules:

(2) merging the multiple lines of annotations into a sentence;

(3) deleting code present in the annotation statement;

step 2: the first k most useful features are selected to train the classifier

information gain, a widely used feature selection method, is employed to select useful features: let the annotation data set be denoted as C { (C)₁,L₁),(C₂,L₂),...,(C_N,L_N)}，C_iRepresents the ith comment, L_iClass labels representing the piece of annotation, i.e. yes (t) no (t)

- (w, t): note C_iContains the feature w and there is a self-acceptance technical debt (i.e., t) in the sentence annotation

·

Note C_iContains the feature w, but no self-acceptance technical debt exists in the sentence annotation

·

·

Note C_iDoes not contain the feature w and does not have self-acceptance technical debt in the sentence annotation

(1)：Native Bayes Multinomial

The invention sets six classifiers, namely No. 2, No. 3, No. 4, No. 5, No. 6 and No. 8 classifiers as a polynomial naive Bayes classifier NBM, and trains by using an NBM method; let the annotations collection be C_i＝{w₁,w₂,…,w_nH, class label L_iThe following can be obtained:

applying bayes' theorem on equation (3) yields:

identifying a classification tag for the annotation by equation (4);

(2)：Simple Logistic

in the experiment, two classifiers, namely the classifier No. 1 and the classifier No. 7 are set as a linear logistic regression classifier SimpleLogistic; let the annotation data set be denoted as C { (C)₁,L₁),(C₂,L₂),...,(C_N,L_N) In which C is_iRepresents the ith comment, L_iA classification label representing the piece of annotation, i.e. whether a self-acceptance technical debt exists; in addition, it also needs to order C_iIs represented as C_i＝{w₁,w₂,…,w_nWhere n represents the notation C_iNumber of features in, w_iRepresents the ith feature in the sentence annotation; according to the linear logistic regression theorem, the following can be obtained:

z＝θ₁w₁+θ₂w₂+…+θ_nw_n+θ₀＝θ^TC_i(5)

it is substituted into a sigmoid function, which is expressed as follows:

and 4, step 4: sub-classifier voting rules

and 5: clustering for self-acceptance technical debt classification

2. The method for detecting and classifying debt of self-acceptance technology based on multi-method ensemble learning as claimed in claim 1, wherein k% in step 2 is 10%.