CN112748951B

CN112748951B - XGboost-based self-acceptance technology debt multi-classification method

Info

Publication number: CN112748951B
Application number: CN202110081268.2A
Authority: CN
Inventors: 陈信; 俞东进; 范旭麟; 王琳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2022-04-22
Anticipated expiration: 2041-01-21
Also published as: CN112748951A

Abstract

The invention discloses a self-acceptance technology debt classification method based on XGboost. According to the method, the self-acceptance technical debt classifier based on the XGboost is constructed, so that the self-acceptance technical debt can be effectively classified. Meanwhile, the method adopts random exchange and random scrambling strategies in the EDA method to enhance data, and uses the class spacing measurement to generate the quality of the data, thereby effectively overcoming the problem of sample imbalance. In addition, the method uses CHI to extract features, selects the first s words with the highest scores (the value of s is 10% of the total number of different features), accelerates model training and improves the performance of the model. The method can effectively classify the technical debts of the software, reduce the cost of software maintenance and has very important significance on the software maintenance.

Description

XGboost-based self-acceptance technology debt multi-classification method

Technical Field

The invention relates to the field of software maintenance, in particular to a self-acceptance technology debt multi-classification method based on XGboost.

Background

Technical Debt (TD) is a metaphor, which refers to the irregular code generated after a software developer has adopted some compromise software development scheme to meet the real business requirements or budget, time constraints. Research has shown that technical debts can significantly degrade software quality, posing a number of challenges to software maintenance. The influence of technical debt on software mainly comprises three aspects, namely maintainability, developability and visibility. Firstly, the readability of the code of the technical debt is poor, the code is difficult to be understood by others, and the code odor may exist in the code, so that the expansion and the enhancement of the software are influenced, and the maintenance of the software is difficult. Secondly, the technical debt reduces the capability of the software system to adapt to changes, so that the system is difficult to support rapid iteration and evolution of functions, and the usability, expandability and flexibility of the system are difficult to meet the actual requirements. Thirdly, for the final user, the technical debt causes defects in the aspects of functions, design, user experience and the like, so that the user cannot smoothly complete the established business process, and the invisible code problem of the user is upgraded into a visible quality problem; for developers, the bulky technical architecture and scattered business logic cause that products cannot respond to demand changes quickly, delivery is delayed, and finally, the difficult-to-understand invisible architecture problem is upgraded to be a visible software delivery risk.

Since technical debt is invisible and exists in projects for a long time, detection and elimination of technical debt becomes an urgent problem to be solved, and has attracted high attention of researchers. As research progresses, researchers have found that sometimes developers intentionally introduce incomplete, defective code or solutions into a project due to factors such as development time tension, budget limitation or commercial interest, etc., and record in code annotations, a type of technical debt known as self-acceptance technical debt (SATD). There have been researches to classify self-acceptance technical debts into 5 categories, namely design technical debt (design TD), demand technical debt (implementation TD), defect technical debt (defect TD), test technical debt (test TD) and documentation debt (documentation TD), wherein different categories of technical debts may need different personnel to process (for example, the test technical debt needs to be solved by a tester, and the defect technical debt needs to be solved by a developer), so as to correctly classify the technical debts, and help development teams to improve work efficiency.

Currently, researchers are mainly focused on the detection of technical liability of software, i.e., by analyzing source code or code annotations and devising automated or semi-automated methods to identify whether technical liability exists in software. However, very little research has focused on the problem of multiple classification of debts from acceptance techniques. Experience has shown that different types of technical debts have different effects on software development, for example design debts often mean that there are large problems in the code and therefore require a high cost for maintenance. A defect liability indicates that there may be a defect or crash in the software, which liability needs to be removed in a timely manner. Therefore, the technical debts of different categories can be identified to help developers to better understand the technical debts in the software, and the repair efficiency of the developers on the technical debts is improved.

Disclosure of Invention

In order to effectively identify different types of technical debts, the invention provides a self-acceptance technical debt multi-classification method based on XGboost, which can effectively classify the technical debts.

The technical scheme adopted by the invention is as follows:

step (1) acquires a code annotation set S ═ S from a dataset₁,S₂,…,S_n) N is the number of code annotations for all classes, where each sample is denoted as S_i＝<id,comment,LB>And i is 1,2, …, n, wherein id represents the number of the code annotation, comment represents the text information of the code annotation, and LB represents the label of the code annotation, namely the type of the technical debt.

Step (2) of sampling S for each sample_iThe comment in (1) is pre-processed.

Firstly, completely same samples in an original data set are filtered by utilizing a character string full matching and cosine similarity calculation method;

then, deleting the historical version record contained in the code annotation;

finally, deleting noise information in the code annotation, wherein the noise information comprises numbers, punctuations, URLs, source codes and stop words; all words are converted to lower case letter form.

After treatment, each sample is S_i＝<id,preComment,LB>Wherein the precompment represents the text information of the pre-processed code annotation.

And (3) performing data enhancement on the text information of the preprocessed code annotation. Because the number of annotations of the technical debt designed in the data set is the largest, the required technical debt and the defective technical debt are relatively less, and the effect of the classifier model is influenced. Therefore, the text information of the code annotation of the two types of samples can be enhanced by adopting random exchange and random scrambling strategies in an EDA (easy Data augmentation) method.

And (4) calculating the weight of each feature in the sample by using a chi-square statistical method, sequencing the features from large to small according to the weight values, and selecting s features with the largest weight.

Step (5) using countvectorer method to express all code annotation texts as word frequency matrix FM_n×sWherein the element FM [ i][j]Indicates the number of occurrences of the jth word in the ith code annotation, where i is 1,2, …, n, j is 1,2, …, s.

And (6) constructing a classifier model based on XGboost.

Firstly, annotating a sample S in a set of codes according to a word frequency matrix FM_iIs shown as S_i＝(x_i,y_i)，x_i＝{FM[i][1],FM[i][2],…,FM[i][s]}，y_iIs the corresponding class label.

Then, the predicted values of all code annotations are calculated.

And finally, training the classifier model by adopting an addition mode, and adding the best tree model into the classifier model at present each time.

And (7) training a classifier model by adopting a leave-one-out cross validation method.

Assuming that p items are contained in the data set, selecting the code annotations of p-1 items as a training set, using the code annotations of the remaining 1 items as a test set, and using the label of the code annotation of the design technical liability in the training set as 0, the label of the code annotation of the demand technical liability as 1, and the label of the code annotation of the defect technical liability as 2. And finally obtaining a trained classifier model through continuous iteration and optimization of the model.

Step (8) classified prediction

For new code annotation, firstly preprocessing the text information of the code annotation, then selecting the characteristics of the text information of the preprocessed code annotation, finally representing the text information of each code annotation into vectors according to the selected characteristics, inputting each vector into a classifier model for prediction, and outputting the predicted value of each code annotation to each class by the classifier model through calculation, wherein the class label with the largest predicted value is the predicted label of the code annotation.

Compared with the traditional classification method, the method has the beneficial effects that:

1. the self-acceptance technology debt classifier based on the XGboost is constructed, the self-acceptance technology debt can be effectively classified, and the classification accuracy is improved.

2. The invention adopts the random exchange and random scrambling strategies to enhance the data, and uses the class interval measurement to generate the quality of the data, thereby effectively overcoming the problem of unbalanced samples.

3. And (3) feature extraction is carried out by using the CHI, and s words before scoring are selected, so that the model training is accelerated, and the performance of the model is improved.

Drawings

Fig. 1 is a flowchart of a self-acceptance technology debt multi-classification method based on XGBoost according to the present invention.

Detailed Description

Data source acquisition: the raw data used in this example is from the public data set compiled by Maldoado and Shihab. This data contains 10 open source items including Ant, ArgoUML, Columba, EMF, Hibernate, JEdit, JFreeChart, JMeter, JRuby, and Squirrel. When constructing a data set, Maldoado and Shihab use JDeodorant to extract code annotations for these ten items and apply existing heuristics to remove some extraneous annotations (e.g., annotations automatically generated by tools, partial code segments, etc.). The data set labels the code annotation as a distinct tag. Since the present invention primarily identifies the design technical debt, the demand technical debt and the defect technical debt in the self-acceptance technical debt annotation, only data related to the design technical debt, the demand technical debt and the defect technical debt is used.

In order to make the purpose, technical scheme and advantages of the present invention more clearly understood, the following detailed description of the XGBoost-based self-acceptance technique debt multi-classification method provided by the present invention with reference to fig. 1 includes the following steps:

step (1) acquires a code annotation set S ═ S from a dataset₁,S₂,…,S_n) N is the number of code annotations for all classes, where each sample is denoted as S_i＝<id,comment,LB>And i is 1,2, …, n, wherein id represents a code annotation number, comment represents text information of the code annotation, and LB represents a label of the code annotation, namely the type of the technical debt.

Step (2) of sampling S for each sample_iThe comment in (1) is preprocessed:

2-1, filtering completely same samples in the original data set by a character string complete matching method and a cosine similarity calculation method (similarity is 1);

2-2, delete historical version record contained in code annotation (this historical version record is usually denoted as "xx-xx-xx: text", where "xx-xx-xx" denotes date and "text" denotes historical record);

2-3, deleting noise information such as numbers, punctuations, URLs (uniform resource locators), source codes and the like contained in the code annotation, and converting all words into a lower case letter form;

2-4, building a stop word list which not only contains words such as "the", "an", "for", "a", etc., but also considers words with a length less than 3 or greater than 20 as stop words, mainly because single english words are usually less than 20 characters, and words with less than 3 characters are generally articles or subjects, which do not provide much useful information for classification;

and 2-5, deleting stop words contained in the text information of the code annotation according to the stop word list.

Each sample after processing is denoted as S_i＝<id,precomment,LB>Wherein the precompment represents the text information of the pre-processed code annotation.

And (3) performing data enhancement on the text information of the preprocessed code annotation. Considering that the code amount of the technical debt designed in the data set is far more than the code amount of the technical debt in demand and the technical debt in defect, data enhancement needs to be carried out on the two types of technical debt:

3-1, generating new required technical debt and defective technical debt sample data by adopting random exchange and random disorder strategies in an EDA method:

random exchange: random selection of the same class of LB from the dataset_rFor each sample, a random position is generated, and each sample is divided into two segments according to the generated random position. The code annotated pieces of text information of the two samples are then exchanged to form two new samples.

Random scrambling: a sample is randomly selected and then the word order of the text information of the code annotation of the sample is randomly shuffled to form a new sample.

3-2, by executing the random exchange strategy 25 times and the random shuffle strategy 50 times, 100 new samples are generated. Given that the generated samples may have a negative impact on the classifier, the class separation is used to evaluate the generated samples and select the sample that is the greatest distance from all samples in all classes:

wherein c is the number of classes, n_iIs the number of samples in the ith class, y represents a generated sample, x_ijRepresents the jth sample in the ith class, and d represents the average distance of the generated sample from all samples in all classes.

This process is performed 1000 times for each small category in the test set, resulting in 1000 new samples for each small category.

Step (4) uses CHI-square statistics (CHI) method to select the most representative feature in the annotation text. CHI Primary measurement class LB_rAnd a feature word w_jThe dependency of (c) is calculated as follows:

wherein A represents belonging to the class LB_rAnd contains the word w_jB denotes that the code does not belong to the class LB_rAnd contains the word w_jC denotes that the code belongs to the class LB_rAnd does not contain the word w_jD denotes that the code does not belong to the class LB_rAnd does not contain the word w_jNumber of code annotations.

In this way, the CHI scores corresponding to each word are obtained and then ranked from high to low. Finally, the s words with the highest CHI scores (s is 10% of the total number of different features) are selected, and then the unselected feature words in all the annotations are deleted in turn.

Step (5) using a countvectorer method to represent the text information of all code annotations into a word frequency matrix FM_n×sWherein the element FM [ i][j]And (3) indicating the occurrence number of the jth word in the text information of the ith code annotation, wherein i is 1,2, …, and n, j is 1,2, …, s.

And (6) constructing a classifier model based on XGboost, wherein the main idea of the XGboost is to combine a plurality of decision tree models with low classification accuracy into a classifier model with high accuracy, establish the classifier model in a distributed manner, and continuously optimize the classifier model along the gradient descending direction in the iteration process so as to ensure that the final prediction result is optimal, and the XGboost has the characteristics of high speed, good robustness and the like. The method comprises the following specific steps:

6-1, annotating the sample S in the set with the code according to the word frequency matrix FM_iIs shown as S_i＝(x_i,y_i)，x_i＝{FM[i][1],FM[i][2],…,FM[i][s]}，y_iIs the corresponding class label.

6-2, calculating the predicted value of all code annotations, wherein the predicted value of the ith annotation can be calculated according to the following formula:

whereinF＝{f(x)＝ω_q(x)}(q:R^s→T,ω∈R^T) The function space representing one regression tree, i.e. all possible regression trees, T represents the total number of leaf nodes of one regression tree, ω represents the weight value of each leaf, q represents the structure of each tree, which will annotate each x_iMapping to the corresponding leaf node, and K represents the number of regression trees.

6-3, adopting an additive training mode when training the XGboost classifier model, namely adding the best tree model into the classifier model at each time. If the predicted value of the ith sample in the t iteration is

The objective function is calculated as follows:

where Ω represents a normalized function, and:

where l denotes the loss function, the squared loss function is used in the present invention:

and (7) training a classifier model, selecting the code annotations of 9 items as a training set, using the annotations of the remaining 1 item as a test set, and representing the label of the code annotation indicating the design technical liability in the training set by 0, the label of the code annotation indicating the demand technical liability by 1 and the label of the code annotation indicating the defect technical liability by 2. And finally obtaining a trained classifier model through continuous iteration and optimization of the model.

And (8) classifying and predicting, namely for the new code annotation, preprocessing the text information of the code annotation, then selecting the characteristics of the text information of the preprocessed code annotation, finally representing the text information of each code annotation into vectors according to the selected characteristics, inputting each vector into a classifier model for prediction, and outputting the predicted value of each code annotation to each class by the classifier model through calculation, wherein the class label with the largest predicted value is the predicted label of the code annotation.

Claims

1. The self-acceptance technology debt multi-classification method based on XGboost is characterized by comprising the following steps:

step (1) acquires a code annotation set S ═ S from a dataset₁，S₂，...，S_n) N is the number of code annotations for all classes, where each sample is denoted as S_i＝<id，comment，LB>1, 2., n, wherein id represents the number of the code annotation, comment represents the text information of the code annotation, and LB represents the label of the code annotation, i.e. the type of technical liability;

step (2) of sampling S for each sample_iPreprocessing the comment in (1);

then, deleting the historical version record contained in the code annotation;

finally, deleting noise information in the code annotation, wherein the noise information comprises numbers, punctuations, URLs, source codes and stop words; converting all words into a lower case letter form;

after treatment, each sample is S_i＝<id，preComment，LB>Wherein the precompment represents the text information of the pre-processed code annotation;

step (3) data enhancement is carried out on the text information of the preprocessed code annotation;

enhancing text information of code annotations of required technical debts and defective technical debts by adopting random exchange and random disorder strategies in an EDA method;

step (4) calculating the weight of each feature in the sample by using a chi-square statistical method, sequencing the features from large to small according to the weight values, and selecting s features with the largest weight;

step (5) using a countvectorer method to represent the text information of all code annotations into a word frequency matrix FM_n×sWherein the element FM [ i][j]Indicating the occurrence number of the jth word in the text information of the ith code annotation, wherein i is 1, 2.

Step (7), constructing a classifier model based on XGboost;

firstly, annotating a sample S in a set of codes according to a word frequency matrix FM_iIs shown as S_i＝(x_i，y_i)，x_i＝{FM[i][1]，FM[i][2]，...，FM[i][s]}，y_iIs a corresponding class label;

then, calculating the predicted values of all code annotations;

finally, training a classifier model in an addition mode, and adding the best tree model into the classifier model each time;

step (7) training a classifier model by adopting a leave-one cross validation method;

assuming that p items are contained in the data set, selecting the code annotations of p-1 items as a training set, using the code annotations of the remaining 1 items as a test set, and representing the label of the code annotation of the design technical liability in the training set by 0, the label of the code annotation of the demand technical liability by 1, and the label of the code annotation of the defect technical liability by 2; continuously iterating and optimizing the classifier model to finally obtain a trained classifier model;

step (8) classified prediction

2. The XGboost-based self-acceptance technology debt multi-classification method according to claim 1, wherein: step (2) also comprises establishing a stop word list, wherein the list not only comprises words of 'the', 'an', 'for' and 'a', but also takes words with the length less than 3 or more than 20 as stop words, and the stop words contained in the text information of the code annotation are deleted according to the stop word list.

3. The XGboost-based self-acceptance technology debt multi-classification method according to claim 1, wherein: in step (3), considering that the generated samples may have negative influence on the classifier, the class interval is used to evaluate the generated samples, and the sample with the largest distance to all the samples in all the classes is selected.