CN113656279B

CN113656279B - Code odor detection method based on residual network and metric attention mechanism

Info

Publication number: CN113656279B
Application number: CN202110732549.XA
Authority: CN
Inventors: 张杨; 东春浩
Original assignee: Hebei University of Science and Technology
Current assignee: Hebei University of Science and Technology
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-07-21
Anticipated expiration: 2041-06-30
Also published as: CN113656279A

Abstract

The invention relates to a code bad smell detection method based on a residual error network and a measurement attention mechanism, which adopts an iPLANMA tool to analyze 20 application programs, obtains structural information and labels of classes and methods in each program, and generates a data set of brain classes and brain methods; acquiring characteristic information of different layers by adopting a residual error network, and re-weighting and distributing different characteristics by introducing a characteristic attention mechanism; training and evaluating the model by adopting two odor data sets, and judging whether the model has code bad odor or not by the structural information of the class or the method after training. The invention discloses two odor data sets, which improves the accuracy of detecting the odor of brains and brain methods, thereby helping developers to more accurately find the design defect problem in the program.

Description

Code odor detection method based on residual network and metric attention mechanism

Technical Field

The invention relates to the field of computer software maintenance and evolution, and provides a code odor detection method based on a residual error network and a metric attention mechanism.

Background

Software maintenance and development is a complex activity that forces developers to steadily modify source code to accommodate new requirements or to repair design flaws found in software. Such activities are typically completed within a strict time frame, and developers are often forced to put down good programming practices and guidelines to deliver the most appropriate products on time, which may lead to technical liabilities, i.e., introduce design issues that may negatively impact future system maintainability.

The reconstruction technology optimizes the software, improves the design quality of the software without changing the external characteristics of the software, and further improves the maintainability and expandability of the software. One key step in software reconstruction is determining where to apply the reconstruction. To facilitate identification of these reconstruction locations, researchers have proposed the concept of code odor to describe design flaws in software. Code odor detection has become an established method to discover problems in source code and correct them by software reconstruction. The research of the code odor detection method also becomes one of research hotspots, and the research and development of the field of software reconstruction are greatly promoted.

Currently, most conventional code odor detection methods rely on manually designed heuristic rules to determine whether there is an odor or not. It is a tedious and laborious task for programmers to manually identify code odors. In addition, the formulation of heuristic rules requires experienced researchers to assist in the formulation, resulting in poor results among the different detection tools due to subjectivity of the developer.

To solve the problems of the conventional methods, various automatic or semi-automatic methods are applied to the code odor detection, such as: support vector machines, J-48, and decision trees, which are used to build code metrics and complex mappings between lexical similarities and predictions. However, demonstration studies indicate that these machine learning-based code odor detection methods have key limitations and deserve further investigation. In contrast to machine learning algorithms, deep neural networks are able to automatically extract features useful for code odor detection from source code and build complex mappings between these features and tags.

Although many promising techniques are proposed, there are still some problems at present. Existing work has focused mainly on those popular code flavors such as jealous odor, emperor, and cuboidal methods, while little research has been done on brain classes and brain methods; secondly, the accuracy of the existing method is not satisfactory, and can be further improved; furthermore, the lack of a publicly available dataset can be used to detect both code odors. Therefore, training of deep learning models by data sets of how to construct brain classes and brain methods is becoming more and more urgent.

Disclosure of Invention

The invention aims to provide a code odor detection method based on a residual error network and a metric attention mechanism, which is time-saving and high in accuracy.

The invention adopts the following technical scheme:

a code odor detection method based on a residual network and a metric attention mechanism, comprising the steps of:

(1) Generating a code odor dataset;

(2) Data balancing;

(3) Constructing a MARS model, wherein the MARS model comprises a convolution layer, a normalization layer, a ReLU layer, a residual error network, an average pooling layer and a full connection layer, the residual error network comprises a plurality of residual error blocks, and each residual error block introduces a measurement attention mechanism;

(4) Training a MARS model;

(5) Judging whether the input information has peculiar smell or not by using the trained MARS model.

The method comprises the steps of (1) taking Github as an open source corpus, analyzing 20 programs in different fields through an iPLANM tool, and extracting code structure information of 13 method levels and 9 kinds of levels; meanwhile, two code odor examples of the brain class and the brain method are obtained, a label is generated by marking the code odor examples, 0 indicates no code odor, 1 indicates code odor, and code measurement information and the label are combined to generate a data set of the two code odors.

The data balancing method in the step (2) specifically includes the steps of generating the number of samples containing code odor by applying a Smote algorithm; for each code odor sample, calculating the distance from the code odor sample to other code odor samples by taking Euclidean distance as a standard, selecting n neighbors of the code odor sample, setting the sampling proportion to be 5:2, randomly selecting a plurality of samples from the n neighbors of the code odor sample, and constructing new samples according to the formula X (new) =X+rand (0, 1) ×X-K) with the original samples respectively assuming that the selected neighbors are K.

In the step (3), each residual block consists of two convolution layers and a jump structure, features in code structure information are obtained through CNN, normalization is performed on Batch Normalization, and ReLu is used as an activation function; a metric attention mechanism is added at the end of each residual block.

The construction method of the attention measurement mechanism in the step (3) comprises the following steps: firstly, compressing the extracted features in a plurality of measurement information into a C-dimensional channel by adopting average pooling, and taking the global space features of each channel as the representation of the channel; secondly, calculating a weight matrix, firstly, carrying out a full connection layer to obtain a vector in C/n dimension, carrying out tanh activation, carrying out full connection again, converting the vector in C/n dimension into a vector in C dimension, adopting sigmoid activation to enable a numerical value to be between 0 and 1, obtaining the weight matrix, finally, multiplying the weight matrix by characteristic information to recalculate the importance degree of the characteristics on the odor of the detection code, and reassigning weights to different characteristics according to the importance degree of the characteristics to increase the weight on the important characteristics of the odor of the detection code.

The training method of the MARS model comprises the steps of dividing a data set into a training set and a testing set, continuously updating parameters by calculating an error value between an output value and a label, finally obtaining a trained classifier, randomly selecting one training set as a verification set by cross verification, and verifying the performance of the model to prevent over fitting.

The invention has the beneficial effects that: the invention supports detection of two kinds of smell of brain class and brain method, can avoid programmer to manually detect whether the smell of the two kinds of codes exists in the source code, and saves time.

In the aspect of detecting brains and brain methods, the average accuracy of the method of the invention is improved by more than 2% compared with the existing code odor detection method.

Drawings

Fig. 1 is a general framework of the method of the invention.

Fig. 2 is a schematic diagram of a modified residual network.

Fig. 3 is a schematic diagram of an original and modified residual block.

Fig. 4 is a schematic diagram of a metric attention mechanism.

FIG. 5 is a schematic diagram of an example of an application of the detection model.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the embodiments of the present application and the accompanying drawings, it being evident that the embodiments described are only some, but not all, of the embodiments of the present application. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The code smell detection method based on the residual network and the attention measurement mechanism is used for detecting the smell of the brain class and the brain method through the steps of code smell data set generation, data balance, attention measurement construction, residual network construction improvement, model training and evaluation and the like.

As shown in fig. 1, because the model is trained and tested, the data sets of two code odors, namely, brain class and brain method, are first generated, 20 open source application programs are analyzed by using an iplama tool, structural information and labels of all classes and methods in each program are extracted to generate the data sets, and the data sets are balanced by using a smote algorithm.

The invention consists of a residual network and a measurement attention mechanism, wherein the characteristic information of deeper different layers is extracted through the residual network. Training the built model by using a training set, preventing the model from being fitted by using a verification set, and evaluating the performance of the model by using a test set. A model of combining a residual network and a metric attention mechanism is adopted to judge whether a certain code smell exists in the program.

1. Code odor dataset generation

With Github as an open source corpus, 20 different field programs are analyzed through an iPLASMA tool, and 13 kinds of code structure information of method levels and 9 kinds of levels are extracted. Meanwhile, two code odor examples of the brain class and the brain method are obtained, a label is generated by marking the code odor examples, 0 indicates no code odor, 1 indicates code odor, and code measurement information and the label are combined to generate a data set of the two code odors. The data set is used to train, verify and evaluate the proposed method to determine if the method has better performance in detecting code odor.

2. Data balancing

The method comprises the steps of generating the number of samples containing code odor by applying a Smote algorithm, calculating the distance from each code odor sample to other code odor samples by taking Euclidean distance as a standard, selecting n neighbors of each code odor sample, setting the sampling proportion to be 5:2, randomly selecting a plurality of samples from the n neighbors, assuming that the selected neighbors are K, respectively constructing new samples with the original samples according to the formula X (new) =X+rand (0, 1) ×X-K, and balancing the number between bad odor data and no bad odor data, thereby solving the problems of low training efficiency and the problem of the whole model performance reduction caused by excessive data of a certain class.

3. Improved residual network construction

Each residual block consists of two convolution layers and a jump structure, the features in the code structure information are obtained through CNN, batch Normalization is normalized, reLu is used as an activation function, the jump structure does not train the trained features, the training parameters of the model are reduced, and the training depth of the model is improved. And finally, adding a measurement attention mechanism into each residual block, and recalculating weights by acquiring the characteristic information of different layers through each residual block, and carrying out weight distribution again to improve the accuracy of code odor detection.

Fig. 2 shows a modified residual network comprising 17 convolutional layers and 1 dense layer. Each residual block has the same structure except for an increase in the number of channels and a decrease in the output size. The present invention introduces a metric attention mechanism for each residual block. An average pool layer is employed to reduce the computational effort of the network. The fully connected layer is used as a classifier for the entire convolutional nerve. The output layer only has one neuron to judge whether the peculiar smell exists by learning the input structural information.

Fig. 3 depicts the residual block before and after modification, which consists of two parts, residual mapping and direct mapping. In the residual part, the structural features are extracted through two convolution layers, the weight of a convolution kernel is learned, the wanted features are extracted according to an objective function, the jump structure is used for carrying out jump processing on the trained features, the training parameters of the model are reduced, and the training depth of the model is improved. The gradient is prevented from disappearing by utilizing the Relu activation function, the nonlinear capability of the network is increased, and the training speed of the network is improved. The characteristic information is processed by adopting normalization, so that the internal covariate offset is solved, the gradient saturation problem is relieved, and the convergence speed is increased. After the second layer of normalization of the original residual block, a metric attention mechanism is introduced, and the metric attention mechanism performs twice scaling on the input features to obtain weight coefficients, and then performs weighted distribution again, so that the weight of important feature information is increased.

4. Metric attention construction

Fig. 4 introduces a metric attention mechanism.

The first step is to compress the extracted features in the plurality of measurement information into a C-dimensional channel by adopting average pooling, and the global space features of each channel are used as the representation of the channel.

And secondly, calculating a weight matrix, firstly, performing a full connection layer to obtain a vector with C/n dimension, using tanh as an activation function to accelerate the convergence rate of the model, performing full connection again, converting the vector with C/n dimension into a vector with C dimension, and using sigmoid activation to enable the value to be between 0 and 1 to obtain the weight matrix.

Finally, multiplying the weight matrix with the feature information to recalculate the importance degree of the features on the detected code smell, and respectively carrying out twice scaling on the extracted feature information of different layers, wherein the twice sampling not only reduces the calculated amount in the network, but also obtains the weight of each feature. The last full-connection layer aims to enhance the adaptability of the network, solve the problem of nonlinearity when the number of full-connection layers is small, and further improve the learning efficiency and nonlinearity expression of the model.

5. Model training and evaluation

The data set is divided into a training set and a testing set, the proportion is 7:3, the training set is used as the input of a model, the batch processing times are 100 after 50 iterations, the label is used as the expected output of the model, the trained classifier is finally obtained by calculating the error value between the output value and the label and continuously updating the parameters, the cross verification is used as the verification set, one training set is randomly selected as the verification set, and the performance of the verification model is prevented from being overfitted. And testing the trained model by adopting a test set, and obtaining the performance of the model on the test set according to three performance indexes of accuracy, precision and F1 value.

6. Application instance

Fig. 5 shows an application example of the trained model, for a Player class in an open source program redox, by obtaining 13 metrics of the Player class as input of the model, where the number of code lines is 231, the ring complexity of the class is 59, and the coupling degree between object classes is 14. And (5) giving a result through model analysis, and judging that the brain smell exists in the Player.

Claims

1. A code odor detection method based on a residual network and a metric attention mechanism, characterized in that it comprises the following steps:

(1) Generating a code odor dataset;

(2) Data balancing;

the construction method of the measurement attention mechanism comprises the following steps: firstly, compressing the extracted features in a plurality of measurement information into a C-dimensional channel by adopting average pooling, and taking the global space features of each channel as the representation of the channel; secondly, calculating a weight matrix, firstly, carrying out a full connection layer to obtain a vector in C/n dimension, carrying out tanh activation, carrying out full connection again, converting the vector in C/n dimension into a vector in C dimension, adopting sigmoid activation to enable a numerical value to be between 0 and 1 to obtain the weight matrix, finally multiplying the weight matrix by characteristic information to recalculate the importance degree of the characteristics on the odor of the detection code, and reassigning weights to different characteristics according to the importance degree of the characteristics to increase the weight on the important characteristics of the odor of the detection code;

(4) Training a MARS model;

2. The code odor detection method based on residual network and metric attention mechanism according to claim 1, wherein step (1) uses Github as an open source corpus, analyzes 20 different domain programs through an iplama tool, and extracts 13 kinds of code structure information of method level and 9 kinds of level; meanwhile, two code odor examples of the brain class and the brain method are obtained, a label is generated by marking the code odor examples, 0 indicates no code odor, 1 indicates code odor, and code measurement information and the label are combined to generate a data set of the two code odors.

3. The method for detecting code odor based on residual network and metric attention mechanism of claim 1 wherein the data balancing method of step (2) is specifically that the number of samples containing code odor is generated by applying Smote algorithm; for each code odor sample, calculating the distance from the code odor sample to other code odor samples by taking Euclidean distance as a standard, selecting n neighbors of the code odor sample, setting the sampling proportion to be 5:2, randomly selecting a plurality of samples from the n neighbors of the code odor sample, and constructing new samples according to the formula X (new) =X+rand (0, 1) ×X-K) with the original samples respectively assuming that the selected neighbors are K.

4. The method for code odor detection based on residual network and metric attention mechanism of claim 1, wherein in step (3), each residual block consists of two convolution layers and one jump structure, features in code structure information are obtained through CNN, batch Normalization is normalized, and ReLu is used as an activation function; a metric attention mechanism is added at the end of each residual block.

5. The code odor detection method based on the residual network and the metric attention mechanism according to claim 1, wherein the training method of the MARS model is characterized in that a data set is divided into a training set and a test set, the trained classifier is finally obtained by calculating an error value between an output value and a label and continuously updating parameters, a training set is randomly selected as a verification set through cross verification, and the performance of the verification model is prevented from being overfitted.