CN114398076A - Object-oriented program method named odor detection method based on deep learning - Google Patents
Object-oriented program method named odor detection method based on deep learning Download PDFInfo
- Publication number
- CN114398076A CN114398076A CN202210059016.4A CN202210059016A CN114398076A CN 114398076 A CN114398076 A CN 114398076A CN 202210059016 A CN202210059016 A CN 202210059016A CN 114398076 A CN114398076 A CN 114398076A
- Authority
- CN
- China
- Prior art keywords
- node
- name
- abstract syntax
- training
- syntax tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/72—Code refactoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an object-oriented program method naming peculiar smell detection method based on deep learning, which takes a method name as a translation sequence of context information of a method body and the method, namely, a proper method name is translated from an input method body and method context information sequence to detect whether the method name has peculiar smell. The method specifically comprises five steps of initialization, information extraction, pre-training, preprocessing and method name prediction model construction. The method comprehensively considers the context information and the method body information on which the method name depends, and applies different processing modes according to the structural characteristics of the context information and the method body information. In addition, the characteristics of the input information structure are also comprehensively considered in the design of the model.
Description
Technical Field
The invention relates to the field of computer software code quality analysis, in particular to a method for naming peculiar smell detection based on deep learning.
Background
In the development of software, developers usually focus on whether code can run successfully, often ignoring the class of code that may cause software quality problems. Any symptom in the code that may cause a deep level of problems is called code odor. Almost existing code odors are not well resolved, which contain confusing names, i.e. name odors with respect to identifiers. Named odor refers primarily to confusing or ambiguous names appearing in identifiers in the code, such as class names, method names, and variable names. Such nomenclature is characterized by an unclear generalization of the effect of modifying the variables. Relevant research has found that 70% of the source code of a software system consists of identifiers, and in practice most developers tend to ignore writing documents and have to read the source code or comments to understand the code. A good identifier can reduce the burden of programmers on software maintenance and reduce the cost of software maintenance in the later period. Therefore, the proper naming of the identifier in the code has certain promotion effect on the later software upgrading maintenance.
At present, researchers have proposed a plurality of processing methods aiming at the existing code named peculiar smell, and the traditional methods such as logic/rule-based and context/feedback-based methods can only simply judge from the grammatical structure of the peculiar smell or adopt semi-automatic recognition of the named peculiar smell. With deep learning applied to this field, it becomes possible to automatically detect named semantic information. Deep learning based methods focus on input information and model differentiation. Almost all methods take into account the method body in the input information. Some methods attempt to add new information, such as class name, class member variable name, etc., and also achieve better results. However, in summary, the information added by the method is fragmented, and the additional information outside the method body is not fully and effectively utilized. On the structure of the input method body, most methods adopt a strategy of preserving the original sequence of the method body, and structural information in the code is ignored. In terms of model design, the space for improving the final effect of the existing deep learning model is large. In summary, although the existing deep learning method has certain improvement in various aspects compared with the past, the efficiency of the existing deep learning method has a large improvement space. In order to more accurately detect the peculiar smell in the method naming, the invention provides named peculiar smell detection of an object-oriented program method based on a TBCNN + bidirectional LSTM neural network model.
Disclosure of Invention
In order to improve the accuracy of the method named odor detection, the invention provides named odor detection of an object-oriented program method based on a TBCNN + bidirectional LSTM neural network model. The method can improve the accuracy of identification through the prediction of the method name.
In order to realize the purpose of the invention, the adopted technical scheme is summarized as follows:
a named odor detection method based on an object-oriented program method of a TBCNN + bidirectional LSTM neural network model is characterized in that the input of the named odor detection method is a code data set, and the output of the named odor detection method is whether an odor result is contained or not. The method is characterized by mainly comprising the following steps:
step 1: initialization
The parameters required for initializing the method are as follows:
Ml: the maximum value of the context length of the method is 50 by default, and when the length is larger than max _ len, truncation processing is carried out;
Ms: the abstract syntax tree of a method contains the maximum value of the subtree, which is 50 by default;
Mn: the maximum value of the number of nodes contained in the subtree, the default value is 70;
Mc: the maximum value of the number of child nodes contained in one node in the subtree is 10 by default;
Bs: reading in the data size at a time, namely the batch processing size, and defaulting to 128;
Nf: the dimension representing the feature vector, default 100;
Es: when the training performance is not improved any more, the training is stopped after a plurality of rounds, and the default value is 5;
Em: maximum number of iteration rounds, default 100.
Initializing the data structure required by the method as follows:
(1) construction method data set D, D ═ D { (D)iI is more than or equal to 1 and less than or equal to n, n represents the total number of data samples, diCan be expressed asTriplet (T) of i methodsi,Mi,Ai),TiDenotes diThe preprocessed method context word set, which may be denoted as Ti={tij|1≤j≤tnWhere t isijRepresents the ith method context TiMiddle j word, tnRepresents a context TiThe number of words contained in (a). MiDenotes diThe preprocessed method volume information set, which may be denoted as Mi={mj|1≤j≤tmWhere t ismThe maximum value of the number of the sub-trees of the abstract syntax tree corresponding to the ith method is Ml,mjRepresenting quadrupletsEach sub-item respectively corresponds to a front-sequence traversal node set, a front-sequence traversal child node set, a hierarchy traversal node set and a hierarchy traversal child node set of the jth abstract syntax tree of the ith method.Can be expressed as WhereinTo representThe kth node name in (1), npTo representThe maximum number of middle nodes is Mn;Can be expressed asWhereinThe set of child nodes corresponding to the kth node in the jth sub-tree of the traversal set of the predecessor in the ith method can be represented asWherein n iscTo representNumber of child nodes involved, cmTo representM-th child node, and cmMaximum value of Mc。Andis shown andandsimilarly. A. theiDenotes diThe set of preprocessed method nouns, which may be denoted Ai={aj|1≤j≤taIn which a isjJ-th word, t, representing the name of the method in the i-th methodaIndicating the number of words contained in the method name.
(2) Constructing an abstract syntax tree node type pre-training set P, P ═ Pi|1≤i≤np},npRepresenting the total number of pre-training samples, piExpressed as the set of abstract syntax tree nodes corresponding to the ith method, which can be expressed as pi={pnj|1≤j≤nn},nnExpressed as nodes of an abstract syntax treeNumber, pnjRepresents the jth node in the ith method abstract syntax tree, which may be represented as a triplet (pr)j,ppj,PCj),prjName, pp, representing the jth node in the ith method abstract syntax treejRepresents the parent node name, PC, of the jth node in the ith method abstract syntax treejA set of child nodes representing a jth node in an ith method abstract syntax tree.
(3) Constructing a word vector set E, E ═ E { (E)i|1≤i≤n},eiCan be represented as a doublet (w)i,vi)。wiRepresenting a word, viRepresents wiA corresponding pre-training word vector. Construct vocabulary set V ═ { V ═ ViI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words in the pre-training vocabulary file and is the same as the number of words in the word embedded file.
Step 2: information extraction
Step 2.1: data set information extraction
And carrying out primary processing on the data set, traversing each class in the data set, screening the class which contains the method and is not empty, counting class names, class comments and member variable names of the class, and representing the class names, the class comments and the member variable names as context information of the method. The context information is then further processed: removing punctuation marks and special characters in the characters, and splitting the single character with a special format (such as a hump type). And finally, the method context information parameter is transmitted to each method in the class.
Traversing each method in the class, obtaining the name of the method and the annotation of the method, removing punctuation marks and special characters in the method, splitting the method according to the format, and then adding the annotation of the method to the context information T of the current methodiIn (1), marking the processed method name set as Ai. And analyzing the abstract syntax tree of the method body, firstly performing a forward traversal on the abstract syntax tree, if the current node is a statement node, adding the current node serving as a root node of a subtree into a list of the forward traversal of the subtree, and if not, continuously traversing the next node. Eventually taking all subtree lists that represent the predecessor traversal of the method. The logic of the hierarchy traversal is similar to the predecessor traversalThe difference is that the subtrees in the subtree list of the hierarchical traversal are arranged in the order of the hierarchical traversal. And finally, storing the preorders and subtree hierarchical traversal lists of the currently obtained method context, method name and method body.
Step 2.2: abstract syntax tree node pre-training data set preprocessing
And constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing the abstract syntax tree. Each node in the hierarchical traversal abstract syntax tree, and the stored result of the jth node of the ith method is expressed as pnjIn which the name pr of the node is recordedjName of parent node ppjAnd name set PC of its child nodesj. The set of nodes that ultimately obtain the method is denoted pi={pnj|1≤j≤nn}. Then p is putiAdding into P: p ← P ≧ U |i。
And step 3: pre-training
Step 3.1: routine vocabulary pre-training
The conventional vocabulary pre-training data set input includes method context and method volume information, which is first pre-processed before input: removing punctuation marks and special characters in the characters, and splitting the single character with a special format (such as a hump type). And constructing a word2vec preprocessing model, inputting a conventional vocabulary pre-training data set, and acquiring a word vector table E and a vocabulary table V after training. Wherein the word vector table E ═ { E ═ E }iI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words of the pre-training words embedded in the file, eiCan be represented as a doublet (w)i,vi)。wiRepresenting a word, viRepresents wiA corresponding pre-training word vector. Glossary V ═ ViI is more than or equal to 1 and less than or equal to n, wherein n represents the number of words in the pre-training vocabulary file and is the same as the number of words in the word embedded file.
Step 3.2: abstract syntax tree node pretraining
The pre-training steps for the nodes in the abstract syntax tree are as follows.
Step 3.2.1: the abstract syntax tree node types present in the data set are counted and all types are put in a node type list NL.
Step 3.2.2: reading B in data set PSAnd (4) batch data. For each node p thereiniAnd acquiring subscript values of the node type and the father node type corresponding to the NL, and then taking the subscript values as input values of the pre-training model. In the pre-training model, the vector of node x is represented as vec (x), whereWherein N isfRepresenting the dimensions of the feature vector, c for each non-leaf node l and its immediate children1,c2,…,cnCharacterized by being respectively vec (o), vec (c)1),vec(c2),…,vec(cn),WiIs ciN of (A)f×NfA weight matrix of ciLower leaf number weighting, denoted as WiWeight value of liB is an offset value, num (x) denotes the number of x, where vec (o) and liThe calculation formula (2) is shown in formulas (1) to (2):
step 3.2.3: constructing a hidden layer and a full connection layer for node class prediction, wherein a hidden layer weight W is definedhIs NfTwo-dimensional matrix of x h, NfDefining the weight W of the full connection layer for the characteristic quantity of the node type and h for the quantity of the hidden layer nodessIs h × l, where l is the number of NL nodes, bh,bsBias of the hidden layer and the fully-connected layer, W, respectivelyhAnd WsIs a normally distributed random number, bhAnd bsAre randomly initialized parameters. tanh represents a hyperbolic tangent activation function, f is an activation function, and the final hidden layer is calculated as shown in formula (3) in a fully-connected mannerThe calculation of the joint layer is shown in equation (4).
hidden=tanh(Wh·E1+bh)#(3)
logits=f(hidden·Ws+bs)#(4)
Step 3.2.4: and (4) evaluating the training effect of the model by using a cross entropy loss function, and inputting the types logits predicted in the step and the real type E as shown in formulas (5) to (6)rAfter passing through the softmax layer, let j denote a certain node type, logits, in NLijThe output value of the ith node of the logits in the jth node type is shown, and the range of j is the number of nodes in NL, namely the number of classified node categories. ErijRepresents EriOutput value at jth node, E when loss value loss is in training processsWithin a round no longer decreases or the number of model execution rounds has reached the upper limit EmAnd if not, reading the data of the next batch, and repeating the step 3.2.2.
Step 3.3: concatenation word embedding table and word table
And (3) splicing the embedded table and the word list obtained in the steps (3.1) and (3.2) to obtain a final embedded table E and a final word list V.
And 4, step 4: pretreatment of
The data set of step 2.1 is further processed, each of which is traversed to the method context T of the ith methodiM before interceptionlA word. And for the method body, respectively traversing the subtree predecessor traversal list and the hierarchical traversal subtree list. Firstly, each subtree in the subtree front-end traversal list is subjected to hierarchical traversal to obtainAndthen, carrying out hierarchical traversal on each subtree in the hierarchical traversal subtree list to obtainAndthen it is represented as a quadruple mj. So for the ith method, its method body can be represented as Mi,Mi={mj|1≤j≤tm},tmAnd representing the number of the sub-trees of the abstract syntax tree corresponding to the ith method. Finally for the ith method, it can finally be represented as a triplet di=(Ti,Mi,Ai) And then adding the data into a method data set D: d ← D { D } D { [ D ] Ui}。
And 5: method name prediction model construction
The model is based on the Encoder-Decoder architecture. Firstly, reading batch data in a data set, and then, aiming at each row in the batch data, performing the following operations:
step 5.1: building an embedded matrix
Input data set D for which method context set TiMethod body information set MiAnd method List word set AiEach word in (a) is characterized. For each word, the word position number is obtained from the dictionary V, and the word vector E is obtained from the word embedding table E according to the numberiE is to beiReplace the original word to obtain the final Ti,MiAnd AiIs represented by a vector of (a). In addition, because the size of the word embedding matrix is fixed, if the number of the characteristic vectors is less than the number of rows of the matrix, all the subsequent rows are filled with 0; if the number of eigenvectors is excessive, the eigenvectors exceeding the number of rows of the matrix are directly discarded.
Step 5.2, constructing an abstract syntax tree embedding model (TBCNN)
Step 5.2.1 defining the convolutionAnd (3) a layer. The feature detector size using a fixed depth is n, where the word vectors containing n abstract syntax tree nodes are x1,x2,…,xnThe local layer weight matrix is a k multiplied by n two-dimensional matrix WconvWhere k is the word embedding dimension. In order to process different numbers of child nodes, three weight matrixes are adopted, and the weight matrixes are respectively usedWherein Wconv,iIs a linear combination of the three weight matrices. Using the activation function tanh, bconvThe bias term is expressed and the final calculation is shown in equation (7).
Step 5.2.2: an attention layer is defined. Generating a subtree representation vector c by weighted fusion of convolutional layer outputsiThe calculation method is shown in formulas (8) to (10), wherein Wa,Vu,baIs a randomly initialized parameter, tanh is an activation function, alphaiIs a weighting factor.
ei=a(y,vi)=vutanh(Wa·y+ba)#(10)
Step 5.2.3: defining a weight W using a hidden layer and a softmax layerhAnd bias bhThe activation function uses tanh, which is calculated as shown in equation (11). The formula of the softmax layer is similar to the formula (5), and finally the characteristic vector s representing a subtree is obtainedh。
sh=tanh(ci·Wh+bh)#(11)
Step 5.3: constructing bidirectional LSTM layers
Step 5.3.1: traversing list vector of subtree predecessor of ith method obtained after TBCNNAnd subtree hierarchical traversal list vectorTreated as a concatenation of subtree vectors, in whichRespectively representThe last subtree row vector. Context vector vec (T)i) Consider the concatenation of each word vector, where vec (t)n) Representing the last feature word vector. The above description is shown in formulas (12) to (14).
Step 5.3.2: will be provided withvec(Ti) The row vectors in (1) are respectively and sequentially input into a circulation unit to obtain each row vector xtCorresponding output vector htThe calculation process can be abstracted into formulas (15) to (20), wherein ctRepresenting the state of the cyclic unit at the current moment in time, ct-1Indicates the state of the last time, and σ indicates Sigmoid activates the function. Others as Wf,Wi,Wo,Wc,bf,bi,bo,bcAre randomly initialized parameters.
ft=σ(Wf·[ht-1,xt]+bf)#(15)
it=σ(Wi·[ht-1,xt]+bi)#(16)
ot=σ(Wo·[ht-1,xt]+bo)#(17)
qi=tanh(Wc·[ht-1,xt]+bc)#(18)
ct=ft*ct-1+it*qi#(19)
ht=ot*tanh(ct)#(20)
Step 5.3.3: will be provided withvec(Ti) Inputting each row vector in the matrix into the circulation unit according to the reverse order of the rows to obtain each reverse order row vector xiCorresponding output vector hi’,hiThe calculation procedure for' is the same as step 5.3.2.
Step 5.3.4: outputting the vector h at each moment in the step 5.3.2 and the step 5.3.3iAnd hi' splicing into vector HiFor H generated at all timesiThe weighted fusion forms a representation vector, which is a forward-sequence traversal method body vectorVolume vector of hierarchical traversal methodAnd method context information vector
Step 5.4: constructing layers of attention
Vector obtained in step 5.3.4Andforming a representation vector F of a final method using weighted fusioniAs shown in equations (8) to (10).
Step 5.5: building a circulating layer
Step 5.5.1: the matrix F formed in step 5.4iConsidered as a series of row vectors xiIs shown in formula (21), xtRepresenting the last row vector.
Step 5.5.2: f is to beiEach row vector x in the matrixiSequentially inputting the data into a circulation unit to obtain each row vector xiCorresponding output vector hiThe calculation formulas are shown in (15) to (20).
Step 5.6: iteration end condition
The model uses a cross entropy loss function to evaluate the training effect of the model, wherein the method nameiFor the real method name, p is the prediction vector, and the equations are shown in (5) to (6). E when Loss value Loss is in training processsWithin a round no longer decreases or the number of model execution rounds has reached the upper limit EmThe model terminates and step 5.7 is performed, otherwise the next batch of data is read, proceeding from step 5.1 again.
Step 5.6: model evaluation
And (3) enabling q to represent a model number, evaluating the model q on a verification set, acquiring the accuracy, recall rate and F1 value of the model on the prediction of the method name, and writing the accuracy, recall rate and F1 value into a log file.
Step 5.7: method name odor detection
And detecting the peculiar smell of the method name in a certain code file or folder by utilizing the fully trained model, if the original method name is similar to the predicted method name in semantics, saying that the current method name has no code peculiar smell, and otherwise, recommending the predicted method name as the method name of the original method.
Compared with the prior art, the invention has the following characteristics:
(1) the method can automatically detect named semantic information based on deep learning, and recommends a proper method name sequence. And the traditional method can only simply judge from the grammatical structure or adopt a method of semi-automatically identifying named peculiar smell.
(2) The method and the system comprehensively consider the dependency information of the method name, the method body and the method context, and provide different processing modes aiming at different dependency information. The context of the method adopts the sequence, the method body adopts the abstract syntax tree, and the method creatively provides an abstract syntax tree processing mode combining split-plus-precedence traversal and hierarchical traversal subtrees. The existing method name prediction method based on deep learning considers single information, and even if many factors are considered, different processing modes are not given according to different characteristics of the information.
(3) The invention provides a deep learning-based method name peculiar smell detection model, wherein the input of the model is different expressions of a method, and the output is a sequence of prediction method names. The method comprises embedding of abstract syntax Trees (TBCNN), bidirectional LSTM, attention mechanism and the like, and comprehensively considers the characteristics of input information.
Drawings
FIG. 1 is a general flow diagram of a method implementation of the present invention;
FIG. 2 is a diagram of the present invention for splitting an abstract syntax tree and traversal;
FIG. 3 is a method name odor detection model based on deep learning of the present invention.
Detailed Description
The invention provides a method named peculiar smell detection method based on deep learning, which can be used for deducing a method name from input method body and method context information so as to detect whether an original method name has code peculiar smell.
The method for naming the odor detection based on the deep learning provided by the invention is described in detail in combination with specific implementation. For convenience of illustration, the method simulates a sample of data (since the space occupied by the method body may be relatively large, no method body information is given in the following table) as shown in table 1:
step 1: initialization
Initialising parameters, e.g. maximum values M of subtrees included in an abstract syntax tree, and data structuress50, maximum value M of the number of nodes contained in the subtreen70, the maximum value M of the number of child nodes contained in a node in the subtreecAt 10, the data size B is read in at a timesDimension N of the feature vector is 128f100, when training performance is not improved, continue several rounds and stop training Es5, etc., data structures such as a method data set D, a method pre-training set P, etc.
Step 2: information extraction
Step 2.1: data set information extraction
The data set information extraction is mainly divided into two parts:
a) removing special characters from the method context information and the method name information, and converting all the special characters into lower case operation and the like;
b) and aiming at the method body information, representing the method body information as an abstract syntax tree, splitting the abstract syntax tree according to statement nodes, adopting a mode of combining a preorder traversal mode and a hierarchical traversal mode on the aspect of subtree arrangement sequence, and preprocessing a pre-training data set of the abstract syntax tree nodes.
After the process of a), Ti={check classpath for conflict predicate file collection classpath a task for checking the classpath for conflictig},AiAfter b), the method volume information is expressed as { "pre-tree":
"MethodDeclaration","children":[{"node":"Parameter","children":[{"nod e":"ClassOrInterfaceType","children":[]}]},{"node":"Parameter","child ren":[{"node":"ClassOrInterfaceType","children":[{"node":"ClassOrInte rfaceType","children":[{"node":"ClassOrInterfaceType","children":[]}]}]}]},{"node":"PrimitiveType","children":[{"node":"boolean","children":[]}]}],……}。
step 2.2: abstract syntax tree node pre-training data set preprocessing
And constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing the abstract syntax tree. Traversing each node in the abstract syntax tree in a hierarchical mode, recording the name of the node, the name of a father node and the name set of child nodes of the node, and finally obtaining the node set p of the methodi. The method in the example is p after the above stepsi={{"node":"MethodDeclaration","parent":null,"children":"Modifier,SimpleName,Parameter,VoidType,BlockStmt"},{"node":"Modifier","parent":"MethodDeclaration","children":""},……}
And step 3: pre-training
Step 3.1: routine vocabulary pre-training
Inputting context information and a method body of a method into a word2vec model, and finally outputting a word vector table and a word table, wherein a word embedding table E { (final-6.27418761.51890612.5367842-0.6484467-1.84728428.301965 … …), (class-3.816071-0.146470381.5672442-3.8296738-0.08557652-0.8994526 … …), … … }, and a word table V { (final, class, … … }
Step 3.2: abstract syntax tree node pretraining
Inputting a pre-training set P, and counting the node types appearing, which are expressed as NL { 'MethodDeclaration', 'null', 'Parameter', 'VoidType', 'BlockStmt', 'Class OrfaceType', 'Expression Stmt', 'AssignExpr', 'FieldAccessExpr', 'NameExpr', 'ThisExpr', 'MarkenrAnotrationExpr', … }, then obtaining the data in the pre-training set P in batch, obtaining the node types and the parent node types at the corresponding subscript values NL, and using the obtained data as the input of the pre-training model, namely for the node name MethodDeclaration, the mapping value in NL is 0, if the data in the training set is: for { "node": Modifier "," parent ": method", "children": and "}, the inputs are 49(Modifier map value) and 0(method classification map value). And then carrying out distributed embedded representation on 49 and 0, and inputting the representation into a pre-training model for training. After the model training is completed, the embedded representation E of the node type is finally obtained, such as { (method classification 0.610464930.941424130.72419140.19248510.82615220.471950050.2165761 …), (Parameter 0.700508240.76230190.442177770.94154560.632501360.040512920.507389660.81111240.807251450.16725480.70254636 …) … }. Word list V, e.g., { MethodDeclaration, Parameter, VoidType, … … }
Step 3.3: abstract syntax tree node pretraining
And splicing the obtained word vector table and the word table.
And 4, step 4: pretreatment of
The method body information needs to be processed again, and the processing process is as follows: and traversing each subtree hierarchically, storing the traversed nodes in a node list, recording a child node list corresponding to each node under the list, and storing the child node list in the child node list.
For example, the list of nodes that are finally output is [ [ [ [1,2,3], [4,0,0], [5,0,0], [6,0,0], [7,0,0], [8,0,0], [ … … ].
And 5: method name prediction model construction
Step 5.1: building an embedded matrix
Firstly, an embedding matrix is constructed, for each word in a method context { check passage for confluent file collection passage a task for packing the passage for conflecting } the position of the word is obtained in a dictionary, then a corresponding word vector replacement is found in a word vector table, for example, the position of the check in the dictionary is 287, a vector of the word is found in the word vector table [ -1.82929340.41434312.0705438-0.69411916-1.37561072.1083343-3.0396674-2.699084-2.02403711.22814252.3237095 … … ], the vector is added into an embedding sequence of a current context, and the final context sequence is expressed as { (-1.82929340.41434312.0705438-0.69411916-1.37561072.1083343-3.0396674-2.699084-2.02403711.22814252.3237095 … …), (0.75084424-0.7415062-2.54815320.85623363.75560572.0467503-0.7671489-1.61961321.853755-2.0131462-1.0526531), … … }.
Needle to subject body lists [ [ 'MethodDeclaration', 'Parameter', 'Primitivetype', 'ClassOrnterfaceType', 'ClassOrfaceType', 'Boolean', 'ClassOrfaceType']……]Finding the corresponding word vector according to the above logic, and finally expressing the word vector as subtonodi_pre:[[0.61046493 0.94142413 0.7241914 0.1924851 0.8261522 0.47195005 0.2165761 0.011403203 0.6280153 0.6820903 0.58157563……],[0.70050824 0.7623019 0.44217777 0.9415456 0.63250136 0.04051292 0.50738966 0.8111124 0.80725145 0.1672548 0.70254636……],……]。
Step 5.2: constructing an abstract syntax tree embedding model
The model is used for representing a sub-tree as a vector, and inputting a list of traversal nodes in the presequence of the sub-tree of the abstract syntax treeAnd child node listOutputs the sub-tree vector information, e.g. as input [ 0.610464930.941424130.72419140.19248510.82615220.471950050.21657610.0114032030.62801530.68209030.58157563 … … ]]And [ [1,2,3]],[4,0,0],[5,0,0],[6,0,0],[0,0,0],[7,0,0],[0,0,0],[8,0,0],[0,0,0]]Finally, a vector representing the sub-tree is output: [ 0.80362740.073523760.678338050.113623020.088570360.608826760.68970930.327088950.677857640.0435330870.5546485 … …]. As a preamble passWhen n subtrees exist, n vectors representing each subtree are output through the model and are recorded as subtree preorder traversal vectors. Also, node lists for hierarchical traversalAnd child node listThe embedded matrix of (a) is input into the model. Finally, inputting n hierarchical traversal subtrees, outputting n vectors representing each subtree, and recording the vectors as subtree hierarchical traversal vectors.
Step 5.3: constructing bidirectional LSTM layers
The presequent traversal vector, the hierarchical traversal vector and the method context vector are respectively processed, because the processing logics are the same, the presequent traversal subtree matrix is taken as a series of subtree row vectors and is sequentially input into the circulation layer to obtain an output vector corresponding to each row vector, then the output vectors are input in a reverse direction to obtain another output vector, and finally the output vectors are weighted and fused to form a representation vector.
Step 5.4: constructing layers of attention
And (4) performing weighted fusion on the three vectors obtained in the step 5.3 to form a representation vector of the final method.
Step 5.5: building a circulating layer
And 5.4, regarding the matrix formed in the step 5.4 as the concatenation of a series of row vectors, inputting the row vectors into a loop layer to generate a word vector sequence, finding out corresponding words according to the word vectors, and finally outputting a predicted word sequence. As an example method, the sequence of the final generated method names may be can conflict, is conflict, whheter conflict, etc. And comparing the method name with the real method name, judging the similarity of the method name and the real method name, and giving a result whether the final method name contains peculiar smell.
In summary, the invention provides a named peculiar smell detection method based on a deep learning method. Based on the characteristics of deep learning, the dependency information of the method name is specifically analyzed, different processing modes are applied according to different structures, for example, sequential input is used for context information of the method, an abstract syntax tree is used for method body information, the method body information is split on the basis, and a sub-tree of the method body information is traversed by using a front-end traversal and a hierarchy traversal, so that a final preprocessing input structure is obtained. And then, constructing a method naming prediction model based on deep learning, and training and optimizing the model to complete the prediction function of the method naming sequence. The preprocessing information processing is considered more comprehensively, and the model design also takes the characteristics of each input information into consideration.
The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the technical spirit of the present invention should be included within the scope of the present invention.
Claims (6)
1. A named peculiar smell detection method based on deep learning and oriented to an object program method is characterized in that: inputting a code data set and outputting a peculiar smell detection result;
the method comprises the following steps:
(1) initializing required parameters and constructing a required related data structure;
(2) extracting information; inputting a code data set, and acquiring context information of each method, wherein the context information comprises a class name, a class annotation, a member variable name in the class and a method annotation; extracting an abstract syntax tree of each method, then splitting and traversing, and preprocessing an abstract syntax tree node pre-training data set;
(3) pre-training; respectively pre-training words in the method context and the method body and node types in the abstract syntax tree;
(4) pre-treating; respectively preprocessing the method context information and the method body information;
(5) constructing a method name prediction model; and constructing an Encode-Decoder model, and outputting a final detection result according to input data.
2. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (1): parameters required by the initialization method, and related data structures required by the construction method:
(2-1) parameter required for initialization method, MlMaximum value of the length of the context of the representation method; msRepresenting the maximum value of a subtree contained in an abstract syntax tree of a method; mnA maximum value representing the number of nodes contained in the sub-tree; mcA maximum value representing the number of child nodes contained in a node in the sub-tree; b issThe method comprises the steps of representing the size of data read at one time, namely the batch processing size; n is a radical offA dimension representing a feature vector; esWhen the training performance is not improved any more, the training is stopped after a plurality of rounds of training are continued;
(2-2) related data Structure required for the construction method
Construction method data set D, D ═ D { (D)iI is more than or equal to 1 and less than or equal to n, n represents the total number of data samples, diTriplet (T) denoted as ith methodi,Mi,Ai),TiDenotes diA set of context words of the preprocessed method, denoted Ti={tij|1≤j≤tnWhere t isijRepresents the ith method context TiMiddle j word, tnRepresents a context TiThe number of words contained in (1); miDenotes diThe preprocessed method volume information set, denoted Mi={mj|1≤j≤tmWhere t ismThe maximum value of the number of the sub-trees of the abstract syntax tree corresponding to the ith method is Ml,mjRepresenting quadrupletsEach item respectively corresponds to a front-sequence traversal node set, a front-sequence traversal child node set, a hierarchy traversal node set and a hierarchy traversal child node set of a jth abstract syntax tree of the ith method;can be expressed asWhereinTo representThe kth node name in (1), npTo representThe maximum number of middle nodes is Mn;Can be expressed asWhereinRepresents the child node set corresponding to the kth node in the jth sub-tree of the forward traversal set in the ith method, and is represented as Wherein n iscTo representNumber of child nodes involved, cmTo representM-th child node, and cmMaximum value of Mc;Andis shown andandsimilarly; a. theiDenotes diThe set of preprocessed method nouns, which may be denoted Ai={aj|1≤j≤taIn which a isjJ-th word, t, representing the name of the method in the i-th methodaRepresenting the number of words contained in the method name;
constructing an abstract syntax tree node type pre-training set P, P ═ Pi|1≤i≤np},npRepresenting the total number of pre-training samples, piExpressed as the set of abstract syntax tree nodes corresponding to the ith method, expressed as pi={pnj|1≤j≤nn},nnExpressed as the number of abstract syntax tree nodes, pnjRepresents the jth node in the ith method abstract syntax tree, represented as a triplet (pr)j,ppj,PCj),prjName, pp, representing the jth node in the ith method abstract syntax treejRepresents the parent node name, PC, of the jth node in the ith method abstract syntax treejA set of child nodes representing a jth node in an ith method abstract syntax tree.
3. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (2): extracting information; inputting a code data set, and acquiring context information of each method, wherein the context information comprises a class name, a class annotation, a member variable name in the class and a method annotation; extracting an abstract syntax tree of each method, and then splitting and traversing the abstract syntax tree; and also preprocessing the abstract syntax tree node pre-training data set:
(2-1) removing punctuation marks and special characters aiming at the method context information, and splitting a single character with a special format; representing the method body information as an abstract syntax tree aiming at the method body information, splitting the method body information according to statement nodes, and adopting a mode of combining a preorder traversal mode and a hierarchical traversal sub-tree in the aspect of sub-tree arrangement sequence;
(2-2) preprocessing the abstract syntax tree node pre-training data set; constructing an abstract syntax tree node type pre-training set P, traversing each method in the data set, and analyzing an abstract syntax tree; and traversing each node in the abstract syntax tree hierarchically, recording the name of the node, the name of a father node and the name set of child nodes of the father node, acquiring the node set representation of a method, and adding the node set of the method into a pre-training set.
4. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (3): pre-training; pre-training words in the method context and method body and node types in the abstract syntax tree respectively:
(3-1) pre-training a conventional vocabulary; the conventional vocabulary pre-training data set input includes method context and method volume information, which is first pre-processed before input: removing punctuations and special characters in the characters, and splitting a single character with a special format (such as a hump type); constructing a word2vec preprocessing model, inputting a conventional vocabulary pre-training data set, and acquiring a word vector table and a vocabulary table after training;
(3-2) pre-training abstract syntax tree nodes; firstly, counting abstract syntax tree node types NL appearing in a data set P, and then acquiring subscript values of each node type and a father node type corresponding to the NL; after a model is input, expressing the model as a distributed embedded vector, constructing a hidden layer and a full-connection layer to carry out node type prediction, evaluating the training effect of the model by using a cross entropy loss function, and finally, when the loss value loss is not reduced any more in the training process or the number of model execution rounds reaches the upper limit, stopping the model, and acquiring a pre-trained word vector table and a pre-trained vocabulary table;
and (3-3) splicing the word vector table and the vocabulary table obtained in the last two steps.
5. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: the step (4): pre-treating; respectively preprocessing the context information and the method body information of the method:
further processing the data set of step (2), traversing each of the methods, and determining the method context T for the ith methodiM before interceptionlA word; for the method body, respectively traversing a subtree predecessor traversal list and a hierarchical traversal subtree list; firstly, each subtree in the subtree front-end traversal list is subjected to hierarchical traversal to obtainAndthen, carrying out hierarchical traversal on each subtree in the hierarchical traversal subtree list to obtainAndthen it is represented as a quadruple mj(ii) a So for the ith method, its method body can be represented as Mi,Mi={mj|1≤j≤tm},tmRepresenting the number of the sub-trees of the abstract syntax tree corresponding to the ith method; finally, for the ith method, it is represented as a triple di=(Ti,Mi,Ai) And then adding the data into a method data set D: d ← D { D } D { [ D ] Ui}。
6. The object-oriented programming method named odor detection method based on deep learning as claimed in claim 1, characterized in that: in the step (5), in the method name prediction model construction, an Encoder-Decoder model is constructed, and a finally predicted method name sequence is output according to input data:
(5-1) constructing an embedding matrix, acquiring word position numbers in a dictionary for each word in the method context, the method body and the method name vector, and then acquiring word vectors from a word embedding table according to the numbers; because the size of the word embedding matrix is fixed, if the number of the characteristic vectors is less than the number of rows of the matrix, all the subsequent rows are filled with 0; if the number of the eigenvectors is excessive, discarding the eigenvectors exceeding the number of the rows of the matrix directly;
(5-2) establishing an abstract syntax tree embedding model TBCNN, and respectively defining a convolutional layer, an attention layer, a hidden layer and a softmax layer to perform vector representation on an abstract syntax tree subtree;
(5-3) constructing a bidirectional LSTM, and traversing the subtree predecessor traversal list vector of the ith method obtained after TBCNNAnd subtree hierarchical traversal list vectorTreated as a concatenation of subtree vectors, context vector vec (T)i) Splicing each word vector; then will bevec(Ti) The row vectors in the vector group are sequentially input into a circulation unit in a sequence and reverse sequence mode respectively, then output vectors obtained in the sequence and reverse sequence are spliced to obtain three fused vectors which are respectively a forward sequence traversal method body vectorVolume vector of hierarchical traversal methodAnd method context information vector
(5-4) constructing an attention layer, and combining vectorsAndforming a representation vector F of a final method using weighted fusioni;
(5-5) constructing a circulation layer, and converting the vector FiViewed as a series of row vectors xiThen each row vector is input into the loop layer to obtain each row vector xiCorresponding output vector hi;
(5-6) the iteration is terminated; evaluating the training effect of the model by using a cross entropy loss function, terminating the training when the loss value is not reduced in the training process or the number of model execution rounds reaches the upper limit, otherwise reading in the next batch of data, and starting the step (5-1);
(5-7) evaluating the model, namely evaluating on the verification set, acquiring the accuracy, recall rate and F1 value of the model on the method name prediction, and writing the accuracy, recall rate and F1 value into a log file;
(5-8) detecting the peculiar smell of the method name in a certain code file or folder by using the fully trained model, if the original method name is similar to the predicted method name in semantics, saying that the current method name has no peculiar smell of the code, otherwise, recommending the predicted method name as the method name of the original method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210059016.4A CN114398076A (en) | 2022-01-18 | 2022-01-18 | Object-oriented program method named odor detection method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210059016.4A CN114398076A (en) | 2022-01-18 | 2022-01-18 | Object-oriented program method named odor detection method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114398076A true CN114398076A (en) | 2022-04-26 |
Family
ID=81231963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210059016.4A Pending CN114398076A (en) | 2022-01-18 | 2022-01-18 | Object-oriented program method named odor detection method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114398076A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115268994A (en) * | 2022-07-26 | 2022-11-01 | 中国海洋大学 | Code feature extraction method based on TBCNN and multi-head self-attention mechanism |
-
2022
- 2022-01-18 CN CN202210059016.4A patent/CN114398076A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115268994A (en) * | 2022-07-26 | 2022-11-01 | 中国海洋大学 | Code feature extraction method based on TBCNN and multi-head self-attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109472024B (en) | Text classification method based on bidirectional circulation attention neural network | |
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN110321563B (en) | Text emotion analysis method based on hybrid supervision model | |
CN111651974B (en) | Implicit discourse relation analysis method and system | |
CN111274790B (en) | Chapter-level event embedding method and device based on syntactic dependency graph | |
CN112306494A (en) | Code classification and clustering method based on convolution and cyclic neural network | |
CN110196906A (en) | Towards financial industry based on deep learning text similarity detection method | |
EP3864565A1 (en) | Method of searching patent documents | |
CN113196277A (en) | System for retrieving natural language documents | |
CN115204143B (en) | Method and system for calculating text similarity based on prompt | |
CN112269874A (en) | Text classification method and system | |
CN113190219A (en) | Code annotation generation method based on recurrent neural network model | |
CN113657123A (en) | Mongolian aspect level emotion analysis method based on target template guidance and relation head coding | |
CN114742069A (en) | Code similarity detection method and device | |
CN111400449B (en) | Regular expression extraction method and device | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
Yu et al. | Learning the relation between code features and code transforms with structured prediction | |
CN116108191A (en) | Deep learning model recommendation method based on knowledge graph | |
CN116861269A (en) | Multi-source heterogeneous data fusion and analysis method in engineering field | |
CN114398076A (en) | Object-oriented program method named odor detection method based on deep learning | |
CN112698831B (en) | Code automatic generation quality evaluation method | |
CN116523402B (en) | Multi-mode data-based network learning resource quality assessment method and system | |
CN115964497A (en) | Event extraction method integrating attention mechanism and convolutional neural network | |
CN113468875A (en) | MNet method for semantic analysis of natural language interaction interface of SCADA system | |
Mohan | Automatic repair and type binding of undeclared variables using neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |