CN114398076A

CN114398076A - Object-oriented program method named odor detection method based on deep learning

Info

Publication number: CN114398076A
Application number: CN202210059016.4A
Authority: CN
Inventors: 吴小囡; 苏航; 高红雨
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-26

Abstract

The invention discloses an object-oriented program method naming peculiar smell detection method based on deep learning, which takes a method name as a translation sequence of context information of a method body and the method, namely, a proper method name is translated from an input method body and method context information sequence to detect whether the method name has peculiar smell. The method specifically comprises five steps of initialization, information extraction, pre-training, preprocessing and method name prediction model construction. The method comprehensively considers the context information and the method body information on which the method name depends, and applies different processing modes according to the structural characteristics of the context information and the method body information. In addition, the characteristics of the input information structure are also comprehensively considered in the design of the model.

Description

Object-oriented program method named odor detection method based on deep learning

Technical Field

The invention relates to the field of computer software code quality analysis, in particular to a method for naming peculiar smell detection based on deep learning.

Background

In the development of software, developers usually focus on whether code can run successfully, often ignoring the class of code that may cause software quality problems. Any symptom in the code that may cause a deep level of problems is called code odor. Almost existing code odors are not well resolved, which contain confusing names, i.e. name odors with respect to identifiers. Named odor refers primarily to confusing or ambiguous names appearing in identifiers in the code, such as class names, method names, and variable names. Such nomenclature is characterized by an unclear generalization of the effect of modifying the variables. Relevant research has found that 70% of the source code of a software system consists of identifiers, and in practice most developers tend to ignore writing documents and have to read the source code or comments to understand the code. A good identifier can reduce the burden of programmers on software maintenance and reduce the cost of software maintenance in the later period. Therefore, the proper naming of the identifier in the code has certain promotion effect on the later software upgrading maintenance.

At present, researchers have proposed a plurality of processing methods aiming at the existing code named peculiar smell, and the traditional methods such as logic/rule-based and context/feedback-based methods can only simply judge from the grammatical structure of the peculiar smell or adopt semi-automatic recognition of the named peculiar smell. With deep learning applied to this field, it becomes possible to automatically detect named semantic information. Deep learning based methods focus on input information and model differentiation. Almost all methods take into account the method body in the input information. Some methods attempt to add new information, such as class name, class member variable name, etc., and also achieve better results. However, in summary, the information added by the method is fragmented, and the additional information outside the method body is not fully and effectively utilized. On the structure of the input method body, most methods adopt a strategy of preserving the original sequence of the method body, and structural information in the code is ignored. In terms of model design, the space for improving the final effect of the existing deep learning model is large. In summary, although the existing deep learning method has certain improvement in various aspects compared with the past, the efficiency of the existing deep learning method has a large improvement space. In order to more accurately detect the peculiar smell in the method naming, the invention provides named peculiar smell detection of an object-oriented program method based on a TBCNN + bidirectional LSTM neural network model.

Disclosure of Invention

In order to improve the accuracy of the method named odor detection, the invention provides named odor detection of an object-oriented program method based on a TBCNN + bidirectional LSTM neural network model. The method can improve the accuracy of identification through the prediction of the method name.

In order to realize the purpose of the invention, the adopted technical scheme is summarized as follows:

a named odor detection method based on an object-oriented program method of a TBCNN + bidirectional LSTM neural network model is characterized in that the input of the named odor detection method is a code data set, and the output of the named odor detection method is whether an odor result is contained or not. The method is characterized by mainly comprising the following steps:

step 1: initialization

The parameters required for initializing the method are as follows:

M_l: the maximum value of the context length of the method is 50 by default, and when the length is larger than max _ len, truncation processing is carried out;

M_s: the abstract syntax tree of a method contains the maximum value of the subtree, which is 50 by default;

M_n: the maximum value of the number of nodes contained in the subtree, the default value is 70;

M_c: the maximum value of the number of child nodes contained in one node in the subtree is 10 by default;

B_s: reading in the data size at a time, namely the batch processing size, and defaulting to 128;

N_f: the dimension representing the feature vector, default 100;

E_s: when the training performance is not improved any more, the training is stopped after a plurality of rounds, and the default value is 5;

E_m: maximum number of iteration rounds, default 100.

Initializing the data structure required by the method as follows:

(1) construction method data set D, D ═ D { (D)_iI is more than or equal to 1 and less than or equal to n, n represents the total number of data samples, d_iCan be expressed asTriplet (T) of i methods_i,M_i,A_i),T_iDenotes d_iThe preprocessed method context word set, which may be denoted as T_i＝{t_ij|1≤j≤t_nWhere t is_ijRepresents the ith method context T_iMiddle j word, t_nRepresents a context T_iThe number of words contained in (a). M_iDenotes d_iThe preprocessed method volume information set, which may be denoted as M_i＝{m_j|1≤j≤t_mWhere t is_mThe maximum value of the number of the sub-trees of the abstract syntax tree corresponding to the ith method is M_l,m_jRepresenting quadruplets

Each sub-item respectively corresponds to a front-sequence traversal node set, a front-sequence traversal child node set, a hierarchy traversal node set and a hierarchy traversal child node set of the jth abstract syntax tree of the ith method.

Can be expressed as

Wherein

To represent

The kth node name in (1), n_pTo represent

The maximum number of middle nodes is M_n；

Can be expressed as

Wherein

The set of child nodes corresponding to the kth node in the jth sub-tree of the traversal set of the predecessor in the ith method can be represented as

Wherein n is_cTo represent

Number of child nodes involved, c_mTo represent

M-th child node, and c_mMaximum value of M_c。

And

is shown and

and

similarly. A. the_iDenotes d_iThe set of preprocessed method nouns, which may be denoted A_i＝{a_j|1≤j≤t_aIn which a is_jJ-th word, t, representing the name of the method in the i-th method_aIndicating the number of words contained in the method name.

(2) Constructing an abstract syntax tree node type pre-training set P, P ═ P_i|1≤i≤n_p},n_pRepresenting the total number of pre-training samples, p_iExpressed as the set of abstract syntax tree nodes corresponding to the ith method, which can be expressed as p_i＝{pn_j|1≤j≤n_n},n_nExpressed as nodes of an abstract syntax treeNumber, pn_jRepresents the jth node in the ith method abstract syntax tree, which may be represented as a triplet (pr)_j,pp_j,PC_j),pr_jName, pp, representing the jth node in the ith method abstract syntax tree_jRepresents the parent node name, PC, of the jth node in the ith method abstract syntax tree_jA set of child nodes representing a jth node in an ith method abstract syntax tree.

(3) Constructing a word vector set E, E ═ E { (E)_i|1≤i≤n},e_iCan be represented as a doublet (w)_i,v_i)。w_iRepresenting a word, v_iRepresents w_iA corresponding pre-training word vector. Construct vocabulary set V ═ { V ═ V_iI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words in the pre-training vocabulary file and is the same as the number of words in the word embedded file.

Step 2: information extraction

Step 2.1: data set information extraction

And carrying out primary processing on the data set, traversing each class in the data set, screening the class which contains the method and is not empty, counting class names, class comments and member variable names of the class, and representing the class names, the class comments and the member variable names as context information of the method. The context information is then further processed: removing punctuation marks and special characters in the characters, and splitting the single character with a special format (such as a hump type). And finally, the method context information parameter is transmitted to each method in the class.

Traversing each method in the class, obtaining the name of the method and the annotation of the method, removing punctuation marks and special characters in the method, splitting the method according to the format, and then adding the annotation of the method to the context information T of the current method_iIn (1), marking the processed method name set as A_i. And analyzing the abstract syntax tree of the method body, firstly performing a forward traversal on the abstract syntax tree, if the current node is a statement node, adding the current node serving as a root node of a subtree into a list of the forward traversal of the subtree, and if not, continuously traversing the next node. Eventually taking all subtree lists that represent the predecessor traversal of the method. The logic of the hierarchy traversal is similar to the predecessor traversalThe difference is that the subtrees in the subtree list of the hierarchical traversal are arranged in the order of the hierarchical traversal. And finally, storing the preorders and subtree hierarchical traversal lists of the currently obtained method context, method name and method body.

Step 2.2: abstract syntax tree node pre-training data set preprocessing

And constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing the abstract syntax tree. Each node in the hierarchical traversal abstract syntax tree, and the stored result of the jth node of the ith method is expressed as pn_jIn which the name pr of the node is recorded_jName of parent node pp_jAnd name set PC of its child nodes_j. The set of nodes that ultimately obtain the method is denoted p_i＝{pn_j|1≤j≤n_n}. Then p is put_iAdding into P: p ← P ≧ U |_i。

And step 3: pre-training

Step 3.1: routine vocabulary pre-training

The conventional vocabulary pre-training data set input includes method context and method volume information, which is first pre-processed before input: removing punctuation marks and special characters in the characters, and splitting the single character with a special format (such as a hump type). And constructing a word2vec preprocessing model, inputting a conventional vocabulary pre-training data set, and acquiring a word vector table E and a vocabulary table V after training. Wherein the word vector table E ═ { E ═ E }_iI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words of the pre-training words embedded in the file, e_iCan be represented as a doublet (w)_i,v_i)。w_iRepresenting a word, v_iRepresents w_iA corresponding pre-training word vector. Glossary V ═ V_iI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words in the pre-training vocabulary file and is the same as the number of words in the word embedded file.

Step 3.2: abstract syntax tree node pretraining

The pre-training steps for the nodes in the abstract syntax tree are as follows.

Step 3.2.1: the abstract syntax tree node types present in the data set are counted and all types are put in a node type list NL.

Step 3.2.2: reading B in data set P_SAnd (4) batch data. For each node p therein_iAnd acquiring subscript values of the node type and the father node type corresponding to the NL, and then taking the subscript values as input values of the pre-training model. In the pre-training model, the vector of node x is represented as vec (x), where

Wherein N is_fRepresenting the dimensions of the feature vector, c for each non-leaf node l and its immediate children₁,c₂,…,c_nCharacterized by being respectively vec (o), vec (c)₁),vec(c₂),…,vec(c_n)，W_iIs c_iN of (A)_f×N_fA weight matrix of c_iLower leaf number weighting, denoted as W_iWeight value of l_iB is an offset value, num (x) denotes the number of x, where vec (o) and l_iThe calculation formula (2) is shown in formulas (1) to (2):

step 3.2.3: constructing a hidden layer and a full connection layer for node class prediction, wherein a hidden layer weight W is defined_hIs N_fTwo-dimensional matrix of x h, N_fDefining the weight W of the full connection layer for the characteristic quantity of the node type and h for the quantity of the hidden layer nodes_sIs h × l, where l is the number of NL nodes, b_h,b_sBias of the hidden layer and the fully-connected layer, W, respectively_hAnd W_sIs a normally distributed random number, b_hAnd b_sAre randomly initialized parameters. tanh represents a hyperbolic tangent activation function, f is an activation function, and the final hidden layer is calculated as shown in formula (3) in a fully-connected mannerThe calculation of the joint layer is shown in equation (4).

hidden＝tanh(W_h·E₁+b_h)#(3)

logits＝f(hidden·W_s+b_s)#(4)

Step 3.2.4: and (4) evaluating the training effect of the model by using a cross entropy loss function, and inputting the types logits predicted in the step and the real type E as shown in formulas (5) to (6)_rAfter passing through the softmax layer, let j denote a certain node type, logits, in NL_ijThe output value of the ith node of the logits in the jth node type is shown, and the range of j is the number of nodes in NL, namely the number of classified node categories. E_rijRepresents E_riOutput value at jth node, E when loss value loss is in training process_sWithin a round no longer decreases or the number of model execution rounds has reached the upper limit E_mAnd if not, reading the data of the next batch, and repeating the step 3.2.2.

Step 3.3: concatenation word embedding table and word table

And (3) splicing the embedded table and the word list obtained in the steps (3.1) and (3.2) to obtain a final embedded table E and a final word list V.

And 4, step 4: pretreatment of

The data set of step 2.1 is further processed, each of which is traversed to the method context T of the ith method_iM before interception_lA word. And for the method body, respectively traversing the subtree predecessor traversal list and the hierarchical traversal subtree list. Firstly, each subtree in the subtree front-end traversal list is subjected to hierarchical traversal to obtain

And

then, carrying out hierarchical traversal on each subtree in the hierarchical traversal subtree list to obtain

And

then it is represented as a quadruple m_j. So for the ith method, its method body can be represented as M_i,M_i＝{m_j|1≤j≤t_m},t_mAnd representing the number of the sub-trees of the abstract syntax tree corresponding to the ith method. Finally for the ith method, it can finally be represented as a triplet d_i＝(T_i,M_i,A_i) And then adding the data into a method data set D: d ← D { D } D { [ D ] U_i}。

And 5: method name prediction model construction

The model is based on the Encoder-Decoder architecture. Firstly, reading batch data in a data set, and then, aiming at each row in the batch data, performing the following operations:

step 5.1: building an embedded matrix

Input data set D for which method context set T_iMethod body information set M_iAnd method List word set A_iEach word in (a) is characterized. For each word, the word position number is obtained from the dictionary V, and the word vector E is obtained from the word embedding table E according to the number_iE is to be_iReplace the original word to obtain the final T_i,M_iAnd A_iIs represented by a vector of (a). In addition, because the size of the word embedding matrix is fixed, if the number of the characteristic vectors is less than the number of rows of the matrix, all the subsequent rows are filled with 0; if the number of eigenvectors is excessive, the eigenvectors exceeding the number of rows of the matrix are directly discarded.

Step 5.2, constructing an abstract syntax tree embedding model (TBCNN)

Step 5.2.1 defining the convolutionAnd (3) a layer. The feature detector size using a fixed depth is n, where the word vectors containing n abstract syntax tree nodes are x₁,x₂,…,x_nThe local layer weight matrix is a k multiplied by n two-dimensional matrix W_convWhere k is the word embedding dimension. In order to process different numbers of child nodes, three weight matrixes are adopted, and the weight matrixes are respectively used

Wherein W_conv,iIs a linear combination of the three weight matrices. Using the activation function tanh, b_convThe bias term is expressed and the final calculation is shown in equation (7).

Step 5.2.2: an attention layer is defined. Generating a subtree representation vector c by weighted fusion of convolutional layer outputs_iThe calculation method is shown in formulas (8) to (10), wherein W_a,V_u,b_aIs a randomly initialized parameter, tanh is an activation function, alpha_iIs a weighting factor.

e_i＝a(y,v_i)＝v_utanh(W_a·y+b_a)#(10)

Step 5.2.3: defining a weight W using a hidden layer and a softmax layer_hAnd bias b_hThe activation function uses tanh, which is calculated as shown in equation (11). The formula of the softmax layer is similar to the formula (5), and finally the characteristic vector s representing a subtree is obtained_h。

s_h＝tanh(c_i·W_h+b_h)#(11)

Step 5.3: constructing bidirectional LSTM layers

Step 5.3.1: traversing list vector of subtree predecessor of ith method obtained after TBCNN

And subtree hierarchical traversal list vector

Treated as a concatenation of subtree vectors, in which

Respectively represent

The last subtree row vector. Context vector vec (T)_i) Consider the concatenation of each word vector, where vec (t)_n) Representing the last feature word vector. The above description is shown in formulas (12) to (14).

Step 5.3.2: will be provided with

vec(T_i) The row vectors in (1) are respectively and sequentially input into a circulation unit to obtain each row vector x_tCorresponding output vector h_tThe calculation process can be abstracted into formulas (15) to (20), wherein c_tRepresenting the state of the cyclic unit at the current moment in time, c_t-1Indicates the state of the last time, and σ indicates Sigmoid activates the function. Others as W_f，W_i，W_o，W_c，b_f，b_i，b_o，b_cAre randomly initialized parameters.

f_t＝σ(W_f·[h_t-1,x_t]+b_f)#(15)

i_t＝σ(W_i·[h_t-1,x_t]+b_i)#(16)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)#(17)

q_i＝tanh(W_c·[h_t-1,x_t]+b_c)#(18)

c_t＝f_t*c_t-1+i_t*q_i#(19)

h_t＝o_t*tanh(c_t)#(20)

Step 5.3.3: will be provided with

vec(T_i) Inputting each row vector in the matrix into the circulation unit according to the reverse order of the rows to obtain each reverse order row vector x_iCorresponding output vector h_i’，h_iThe calculation procedure for' is the same as step 5.3.2.

Step 5.3.4: outputting the vector h at each moment in the step 5.3.2 and the step 5.3.3_iAnd h_i' splicing into vector H_iFor H generated at all times_iThe weighted fusion forms a representation vector, which is a forward-sequence traversal method body vector

Volume vector of hierarchical traversal method

And method context information vector

Step 5.4: constructing layers of attention

Vector obtained in step 5.3.4

And

forming a representation vector F of a final method using weighted fusion_iAs shown in equations (8) to (10).

Step 5.5: building a circulating layer

Step 5.5.1: the matrix F formed in step 5.4_iConsidered as a series of row vectors x_iIs shown in formula (21), x_tRepresenting the last row vector.

Step 5.5.2: f is to be_iEach row vector x in the matrix_iSequentially inputting the data into a circulation unit to obtain each row vector x_iCorresponding output vector h_iThe calculation formulas are shown in (15) to (20).

Step 5.6: iteration end condition

The model uses a cross entropy loss function to evaluate the training effect of the model, wherein the method name_iFor the real method name, p is the prediction vector, and the equations are shown in (5) to (6). E when Loss value Loss is in training process_sWithin a round no longer decreases or the number of model execution rounds has reached the upper limit E_mThe model terminates and step 5.7 is performed, otherwise the next batch of data is read, proceeding from step 5.1 again.

Step 5.6: model evaluation

And (3) enabling q to represent a model number, evaluating the model q on a verification set, acquiring the accuracy, recall rate and F1 value of the model on the prediction of the method name, and writing the accuracy, recall rate and F1 value into a log file.

Step 5.7: method name odor detection

And detecting the peculiar smell of the method name in a certain code file or folder by utilizing the fully trained model, if the original method name is similar to the predicted method name in semantics, saying that the current method name has no code peculiar smell, and otherwise, recommending the predicted method name as the method name of the original method.

Compared with the prior art, the invention has the following characteristics:

(1) the method can automatically detect named semantic information based on deep learning, and recommends a proper method name sequence. And the traditional method can only simply judge from the grammatical structure or adopt a method of semi-automatically identifying named peculiar smell.

(2) The method and the system comprehensively consider the dependency information of the method name, the method body and the method context, and provide different processing modes aiming at different dependency information. The context of the method adopts the sequence, the method body adopts the abstract syntax tree, and the method creatively provides an abstract syntax tree processing mode combining split-plus-precedence traversal and hierarchical traversal subtrees. The existing method name prediction method based on deep learning considers single information, and even if many factors are considered, different processing modes are not given according to different characteristics of the information.

(3) The invention provides a deep learning-based method name peculiar smell detection model, wherein the input of the model is different expressions of a method, and the output is a sequence of prediction method names. The method comprises embedding of abstract syntax Trees (TBCNN), bidirectional LSTM, attention mechanism and the like, and comprehensively considers the characteristics of input information.

Drawings

FIG. 1 is a general flow diagram of a method implementation of the present invention;

FIG. 2 is a diagram of the present invention for splitting an abstract syntax tree and traversal;

FIG. 3 is a method name odor detection model based on deep learning of the present invention.

Detailed Description

The invention provides a method named peculiar smell detection method based on deep learning, which can be used for deducing a method name from input method body and method context information so as to detect whether an original method name has code peculiar smell.

The method for naming the odor detection based on the deep learning provided by the invention is described in detail in combination with specific implementation. For convenience of illustration, the method simulates a sample of data (since the space occupied by the method body may be relatively large, no method body information is given in the following table) as shown in table 1:

step 1: initialization

Initialising parameters, e.g. maximum values M of subtrees included in an abstract syntax tree, and data structures_s50, maximum value M of the number of nodes contained in the subtree_n70, the maximum value M of the number of child nodes contained in a node in the subtree_cAt 10, the data size B is read in at a time_sDimension N of the feature vector is 128_f100, when training performance is not improved, continue several rounds and stop training E_s5, etc., data structures such as a method data set D, a method pre-training set P, etc.

Step 2: information extraction

Step 2.1: data set information extraction

The data set information extraction is mainly divided into two parts:

a) removing special characters from the method context information and the method name information, and converting all the special characters into lower case operation and the like;

b) and aiming at the method body information, representing the method body information as an abstract syntax tree, splitting the abstract syntax tree according to statement nodes, adopting a mode of combining a preorder traversal mode and a hierarchical traversal mode on the aspect of subtree arrangement sequence, and preprocessing a pre-training data set of the abstract syntax tree nodes.

After the process of a), T_i＝{check classpath for conflict predicate file collection classpath a task for checking the classpath for conflictig},A_iAfter b), the method volume information is expressed as { "pre-tree":

"MethodDeclaration","children":[{"node":"Parameter","children":[{"nod e":"ClassOrInterfaceType","children":[]}]},{"node":"Parameter","child ren":[{"node":"ClassOrInterfaceType","children":[{"node":"ClassOrInte rfaceType","children":[{"node":"ClassOrInterfaceType","children":[]}]}]}]},{"node":"PrimitiveType","children":[{"node":"boolean","children":[]}]}],……}。

step 2.2: abstract syntax tree node pre-training data set preprocessing

And constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing the abstract syntax tree. Traversing each node in the abstract syntax tree in a hierarchical mode, recording the name of the node, the name of a father node and the name set of child nodes of the node, and finally obtaining the node set p of the method_i. The method in the example is p after the above steps_i＝{{"node":"MethodDeclaration","parent":null,"children":"Modifier,SimpleName,Parameter,VoidType,BlockStmt"},{"node":"Modifier","parent":"MethodDeclaration","children":""},……}

And step 3: pre-training

Step 3.1: routine vocabulary pre-training

Inputting context information and a method body of a method into a word2vec model, and finally outputting a word vector table and a word table, wherein a word embedding table E { (final-6.27418761.51890612.5367842-0.6484467-1.84728428.301965 … …), (class-3.816071-0.146470381.5672442-3.8296738-0.08557652-0.8994526 … …), … … }, and a word table V { (final, class, … … }

Step 3.2: abstract syntax tree node pretraining

Inputting a pre-training set P, and counting the node types appearing, which are expressed as NL { 'MethodDeclaration', 'null', 'Parameter', 'VoidType', 'BlockStmt', 'Class OrfaceType', 'Expression Stmt', 'AssignExpr', 'FieldAccessExpr', 'NameExpr', 'ThisExpr', 'MarkenrAnotrationExpr', … }, then obtaining the data in the pre-training set P in batch, obtaining the node types and the parent node types at the corresponding subscript values NL, and using the obtained data as the input of the pre-training model, namely for the node name MethodDeclaration, the mapping value in NL is 0, if the data in the training set is: for { "node": Modifier "," parent ": method", "children": and "}, the inputs are 49(Modifier map value) and 0(method classification map value). And then carrying out distributed embedded representation on 49 and 0, and inputting the representation into a pre-training model for training. After the model training is completed, the embedded representation E of the node type is finally obtained, such as { (method classification 0.610464930.941424130.72419140.19248510.82615220.471950050.2165761 …), (Parameter 0.700508240.76230190.442177770.94154560.632501360.040512920.507389660.81111240.807251450.16725480.70254636 …) … }. Word list V, e.g., { MethodDeclaration, Parameter, VoidType, … … }

Step 3.3: abstract syntax tree node pretraining

And splicing the obtained word vector table and the word table.

And 4, step 4: pretreatment of

The method body information needs to be processed again, and the processing process is as follows: and traversing each subtree hierarchically, storing the traversed nodes in a node list, recording a child node list corresponding to each node under the list, and storing the child node list in the child node list.

For example, the list of nodes that are finally output is [ [ [ [1,2,3], [4,0,0], [5,0,0], [6,0,0], [7,0,0], [8,0,0], [ … … ].

And 5: method name prediction model construction

Step 5.1: building an embedded matrix

Firstly, an embedding matrix is constructed, for each word in a method context { check passage for confluent file collection passage a task for packing the passage for conflecting } the position of the word is obtained in a dictionary, then a corresponding word vector replacement is found in a word vector table, for example, the position of the check in the dictionary is 287, a vector of the word is found in the word vector table [ -1.82929340.41434312.0705438-0.69411916-1.37561072.1083343-3.0396674-2.699084-2.02403711.22814252.3237095 … … ], the vector is added into an embedding sequence of a current context, and the final context sequence is expressed as { (-1.82929340.41434312.0705438-0.69411916-1.37561072.1083343-3.0396674-2.699084-2.02403711.22814252.3237095 … …), (0.75084424-0.7415062-2.54815320.85623363.75560572.0467503-0.7671489-1.61961321.853755-2.0131462-1.0526531), … … }.

Needle to subject body lists [ [ 'MethodDeclaration', 'Parameter', 'Primitivetype', 'ClassOrnterfaceType', 'ClassOrfaceType', 'Boolean', 'ClassOrfaceType']……]Finding the corresponding word vector according to the above logic, and finally expressing the word vector as subtonod_{i_pre}:[[0.61046493 0.94142413 0.7241914 0.1924851 0.8261522 0.47195005 0.2165761 0.011403203 0.6280153 0.6820903 0.58157563……],[0.70050824 0.7623019 0.44217777 0.9415456 0.63250136 0.04051292 0.50738966 0.8111124 0.80725145 0.1672548 0.70254636……],……]。

Step 5.2: constructing an abstract syntax tree embedding model

The model is used for representing a sub-tree as a vector, and inputting a list of traversal nodes in the presequence of the sub-tree of the abstract syntax tree

And child node list

Outputs the sub-tree vector information, e.g. as input [ 0.610464930.941424130.72419140.19248510.82615220.471950050.21657610.0114032030.62801530.68209030.58157563 … … ]]And [ [1,2,3]],[4,0,0],[5,0,0],[6,0,0],[0,0,0],[7,0,0],[0,0,0],[8,0,0],[0,0,0]]Finally, a vector representing the sub-tree is output: [ 0.80362740.073523760.678338050.113623020.088570360.608826760.68970930.327088950.677857640.0435330870.5546485 … …]. As a preamble passWhen n subtrees exist, n vectors representing each subtree are output through the model and are recorded as subtree preorder traversal vectors. Also, node lists for hierarchical traversal

And child node list

The embedded matrix of (a) is input into the model. Finally, inputting n hierarchical traversal subtrees, outputting n vectors representing each subtree, and recording the vectors as subtree hierarchical traversal vectors.

Step 5.3: constructing bidirectional LSTM layers

The presequent traversal vector, the hierarchical traversal vector and the method context vector are respectively processed, because the processing logics are the same, the presequent traversal subtree matrix is taken as a series of subtree row vectors and is sequentially input into the circulation layer to obtain an output vector corresponding to each row vector, then the output vectors are input in a reverse direction to obtain another output vector, and finally the output vectors are weighted and fused to form a representation vector.

Step 5.4: constructing layers of attention

And (4) performing weighted fusion on the three vectors obtained in the step 5.3 to form a representation vector of the final method.

Step 5.5: building a circulating layer

And 5.4, regarding the matrix formed in the step 5.4 as the concatenation of a series of row vectors, inputting the row vectors into a loop layer to generate a word vector sequence, finding out corresponding words according to the word vectors, and finally outputting a predicted word sequence. As an example method, the sequence of the final generated method names may be can conflict, is conflict, whheter conflict, etc. And comparing the method name with the real method name, judging the similarity of the method name and the real method name, and giving a result whether the final method name contains peculiar smell.

In summary, the invention provides a named peculiar smell detection method based on a deep learning method. Based on the characteristics of deep learning, the dependency information of the method name is specifically analyzed, different processing modes are applied according to different structures, for example, sequential input is used for context information of the method, an abstract syntax tree is used for method body information, the method body information is split on the basis, and a sub-tree of the method body information is traversed by using a front-end traversal and a hierarchy traversal, so that a final preprocessing input structure is obtained. And then, constructing a method naming prediction model based on deep learning, and training and optimizing the model to complete the prediction function of the method naming sequence. The preprocessing information processing is considered more comprehensively, and the model design also takes the characteristics of each input information into consideration.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the technical spirit of the present invention should be included within the scope of the present invention.

Claims

1. A named peculiar smell detection method based on deep learning and oriented to an object program method is characterized in that: inputting a code data set and outputting a peculiar smell detection result;

the method comprises the following steps:

(1) initializing required parameters and constructing a required related data structure;

(2) extracting information; inputting a code data set, and acquiring context information of each method, wherein the context information comprises a class name, a class annotation, a member variable name in the class and a method annotation; extracting an abstract syntax tree of each method, then splitting and traversing, and preprocessing an abstract syntax tree node pre-training data set;

(3) pre-training; respectively pre-training words in the method context and the method body and node types in the abstract syntax tree;

(4) pre-treating; respectively preprocessing the method context information and the method body information;

(5) constructing a method name prediction model; and constructing an Encode-Decoder model, and outputting a final detection result according to input data.

2. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (1): parameters required by the initialization method, and related data structures required by the construction method:

(2-1) parameter required for initialization method, M_lMaximum value of the length of the context of the representation method; m_sRepresenting the maximum value of a subtree contained in an abstract syntax tree of a method; m_nA maximum value representing the number of nodes contained in the sub-tree; m_cA maximum value representing the number of child nodes contained in a node in the sub-tree; b is_sThe method comprises the steps of representing the size of data read at one time, namely the batch processing size; n is a radical of_fA dimension representing a feature vector; e_sWhen the training performance is not improved any more, the training is stopped after a plurality of rounds of training are continued;

(2-2) related data Structure required for the construction method

Construction method data set D, D ═ D { (D)_iI is more than or equal to 1 and less than or equal to n, n represents the total number of data samples, d_iTriplet (T) denoted as ith method_i,M_i,A_i),T_iDenotes d_iA set of context words of the preprocessed method, denoted T_i＝{t_ij|1≤j≤t_nWhere t is_ijRepresents the ith method context T_iMiddle j word, t_nRepresents a context T_iThe number of words contained in (1); m_iDenotes d_iThe preprocessed method volume information set, denoted M_i＝{m_j|1≤j≤t_mWhere t is_mThe maximum value of the number of the sub-trees of the abstract syntax tree corresponding to the ith method is M_l,m_jRepresenting quadruplets

Each item respectively corresponds to a front-sequence traversal node set, a front-sequence traversal child node set, a hierarchy traversal node set and a hierarchy traversal child node set of a jth abstract syntax tree of the ith method;

can be expressed as

Wherein

To represent

The kth node name in (1), n_pTo represent

The maximum number of middle nodes is M_n；

Can be expressed as

Wherein

Represents the child node set corresponding to the kth node in the jth sub-tree of the forward traversal set in the ith method, and is represented as

Wherein n is_cTo represent

Number of child nodes involved, c_mTo represent

M-th child node, and c_mMaximum value of M_c；

And

is shown and

and

similarly; a. the_iDenotes d_iThe set of preprocessed method nouns, which may be denoted A_i＝{a_j|1≤j≤t_aIn which a is_jJ-th word, t, representing the name of the method in the i-th method_aRepresenting the number of words contained in the method name;

constructing an abstract syntax tree node type pre-training set P, P ═ P_i|1≤i≤n_p},n_pRepresenting the total number of pre-training samples, p_iExpressed as the set of abstract syntax tree nodes corresponding to the ith method, expressed as p_i＝{pn_j|1≤j≤n_n},n_nExpressed as the number of abstract syntax tree nodes, pn_jRepresents the jth node in the ith method abstract syntax tree, represented as a triplet (pr)_j,pp_j,PC_j),pr_jName, pp, representing the jth node in the ith method abstract syntax tree_jRepresents the parent node name, PC, of the jth node in the ith method abstract syntax tree_jA set of child nodes representing a jth node in an ith method abstract syntax tree.

3. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (2): extracting information; inputting a code data set, and acquiring context information of each method, wherein the context information comprises a class name, a class annotation, a member variable name in the class and a method annotation; extracting an abstract syntax tree of each method, and then splitting and traversing the abstract syntax tree; and also preprocessing the abstract syntax tree node pre-training data set:

(2-1) removing punctuation marks and special characters aiming at the method context information, and splitting a single character with a special format; representing the method body information as an abstract syntax tree aiming at the method body information, splitting the method body information according to statement nodes, and adopting a mode of combining a preorder traversal mode and a hierarchical traversal sub-tree in the aspect of sub-tree arrangement sequence;

(2-2) preprocessing the abstract syntax tree node pre-training data set; constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing an abstract syntax tree; and traversing each node in the abstract syntax tree hierarchically, recording the name of the node, the name of a father node and the name set of child nodes of the father node, acquiring the node set representation of a method, and adding the node set of the method into a pre-training set.

4. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (3): pre-training; pre-training words in the method context and method body and node types in the abstract syntax tree respectively:

(3-1) pre-training a conventional vocabulary; the conventional vocabulary pre-training data set input includes method context and method volume information, which is first pre-processed before input: removing punctuations and special characters in the characters, and splitting a single character with a special format (such as a hump type); constructing a word2vec preprocessing model, inputting a conventional vocabulary pre-training data set, and acquiring a word vector table and a vocabulary table after training;

(3-2) pre-training abstract syntax tree nodes; firstly, counting abstract syntax tree node types NL appearing in a data set P, and then acquiring subscript values of each node type and a father node type corresponding to the NL; after a model is input, expressing the model as a distributed embedded vector, constructing a hidden layer and a full-connection layer to carry out node type prediction, evaluating the training effect of the model by using a cross entropy loss function, and finally, when the loss value loss is not reduced any more in the training process or the number of model execution rounds reaches the upper limit, stopping the model, and acquiring a pre-trained word vector table and a pre-trained vocabulary table;

and (3-3) splicing the word vector table and the vocabulary table obtained in the last two steps.

5. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (4): pre-treating; respectively preprocessing the context information and the method body information of the method:

further processing the data set of step (2), traversing each of the methods, and determining the method context T for the ith method_iM before interception_lA word; for the method body, respectively traversing a subtree predecessor traversal list and a hierarchical traversal subtree list; firstly, each subtree in the subtree front-end traversal list is subjected to hierarchical traversal to obtain

And

And

then it is represented as a quadruple m_j(ii) a So for the ith method, its method body can be represented as M_i,M_i＝{m_j|1≤j≤t_m},t_mRepresenting the number of the sub-trees of the abstract syntax tree corresponding to the ith method; finally, for the ith method, it is represented as a triple d_i＝(T_i,M_i,A_i) And then adding the data into a method data set D: d ← D { D } D { [ D ] U_i}。

6. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: in the step (5), in the method name prediction model construction, an Encoder-Decoder model is constructed, and a finally predicted method name sequence is output according to input data:

(5-1) constructing an embedding matrix, acquiring word position numbers in a dictionary for each word in the method context, the method body and the method name vector, and then acquiring word vectors from a word embedding table according to the numbers; because the size of the word embedding matrix is fixed, if the number of the characteristic vectors is less than the number of rows of the matrix, all the subsequent rows are filled with 0; if the number of the eigenvectors is excessive, discarding the eigenvectors exceeding the number of the rows of the matrix directly;

(5-2) establishing an abstract syntax tree embedding model TBCNN, and respectively defining a convolutional layer, an attention layer, a hidden layer and a softmax layer to perform vector representation on an abstract syntax tree subtree;

(5-3) constructing a bidirectional LSTM, and traversing the subtree predecessor traversal list vector of the ith method obtained after TBCNN

And subtree hierarchical traversal list vector

Treated as a concatenation of subtree vectors, context vector vec (T)_i) Splicing each word vector; then will be

vec(T_i) The row vectors in the vector group are sequentially input into a circulation unit in a sequence and reverse sequence mode respectively, then output vectors obtained in the sequence and reverse sequence are spliced to obtain three fused vectors which are respectively a forward sequence traversal method body vector

Volume vector of hierarchical traversal method

And method context information vector

(5-4) constructing an attention layer, and combining vectors

And

forming a representation vector F of a final method using weighted fusion_i；

(5-5) constructing a circulation layer, and converting the vector F_iViewed as a series of row vectors x_iThen each row vector is input into the loop layer to obtain each row vector x_iCorresponding output vector h_i；

(5-6) the iteration is terminated; evaluating the training effect of the model by using a cross entropy loss function, terminating the training when the loss value is not reduced in the training process or the number of model execution rounds reaches the upper limit, otherwise reading in the next batch of data, and starting the step (5-1);

(5-7) evaluating the model, namely evaluating on the verification set, acquiring the accuracy, recall rate and F1 value of the model on the method name prediction, and writing the accuracy, recall rate and F1 value into a log file;

(5-8) detecting the peculiar smell of the method name in a certain code file or folder by using the fully trained model, if the original method name is similar to the predicted method name in semantics, saying that the current method name has no peculiar smell of the code, otherwise, recommending the predicted method name as the method name of the original method.