CN110008344B - Method for automatically marking data structure label on code - Google Patents
Method for automatically marking data structure label on code Download PDFInfo
- Publication number
- CN110008344B CN110008344B CN201910304797.7A CN201910304797A CN110008344B CN 110008344 B CN110008344 B CN 110008344B CN 201910304797 A CN201910304797 A CN 201910304797A CN 110008344 B CN110008344 B CN 110008344B
- Authority
- CN
- China
- Prior art keywords
- node
- code
- nodes
- vector
- codes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a method for automatically marking a data structure label on a code, which belongs to the field of natural language processing under artificial intelligence, and comprises the following steps: converting the code into an abstract syntax tree using a lexical analyzer and a syntax analyzer; modeling the abstract syntax tree, and coding each node on the tree from bottom to top by using an attention mechanism and a residual block to obtain the code of the whole tree; and finally, labeling the code with a data structure through a classifier in the model. According to the method, the data structure label can be automatically marked on the code, and the workload of manually marking the data structure label on the code is reduced.
Description
Technical Field
The invention belongs to the field of natural language processing under artificial intelligence, and particularly relates to a method for automatically marking a data structure label on a code.
Background
With the popularization of the internet, a large number of high-quality codes appear on the internet, but many codes do not have labels of data structures, so that the user is inconvenient to inquire and learn, and the manual marking of data structure labels for massive codes is unrealistic.
Disclosure of Invention
The invention provides a method for automatically marking a data structure label on a code. The method comprises the steps of converting codes into an abstract syntax tree by using a lexical analyzer and a syntax analyzer, then embedding words into each word, sequentially coding each node from bottom to top by using methods such as a residual block and an attention mechanism and the like on the tree, finally obtaining codes of a root node, wherein the codes comprise syntax and semantic expressions of all sub nodes and semantic expressions of self nodes, and finally classifying by using the expressions of the root node.
The invention relates to a method for automatically labeling a data structure on a code, which comprises the following steps:
step 1: code for a number of annotated data structures is collected from web pages using crawler technology.
Step 2: because different codes have different grammars, different lexical analyzers are required to be used for different languages, the lexical analyzers are used for replacing variables of different types in the codes with corresponding words, and the lexical analyzers replace numbers such as 1, 1.1 and the like with Num; the lexical analyzer replaces all variable names with names; all the character strings of the lexical analyzer are replaced by Str, wherein the lexical analyzer does not replace the keywords corresponding to the language.
And step 3: and (4) corresponding parsers are used for different languages, and the parsers are used for converting the codes after lexical analysis into abstract syntax trees.
And 4, step 4: and performing word embedding on words generated after lexical analysis and syntactic analysis, and performing word embedding on words such as Num, Name, root node Module, assignment operation Assign and the like.
And 5: and carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes.
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
Where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Step 6: non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Wherein Q is a matrix formed by superposing vectors of n same current nodes transformed by a residual block, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes transformed by the residual block, the score function is used for calculating the similarity between the expression of the current node and the expression of each sub-node, the higher the similarity is,the higher the probability after softmax, the higher the score function can calculate the similarity between the current node and the sub-nodes in three ways, VcIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e″=ReLU(Rebq(e′)+Rebc(Vc)+b)
where e' is the vector encoding of the current node, VcFor the attention vector, Reb is a residual block, b is a bias value, ReLU is a ReLU activation function, and e "is vector coding fused by using residual block coding for e' current node vector coding and using residual block coding for Vc attention vector coding.
And 7: according to the above formula, the expression of each node is calculated on the tree from bottom to top, and finally the expression of the root node is used for classification, and because the code possibly belongs to a plurality of categories, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structure.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
Wherein e'rFor semantic expression of a root node, W1 and W2 are parameters, b is a bias value, ReLU is a ReLU activation function, and sigmoid is a sigmoid function.
And 8: the method comprises the steps of training a model, using a large number of label data structure codes to train an integral model, firstly using a lexical analyzer to carry out lexical analysis on the codes, replacing numbers such as 1 and 1.1 with Num, replacing all variable names with Name, replacing all character strings with Str, using a grammar analyzer to convert the codes after the lexical analysis into an abstract syntax tree, embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector of the node, and using a residual block to carry out nonlinear transformation on the embedded codes of each node to obtain new semantic codes. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
Finally, the codes of the root nodes are used for classification, and because the codes can belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e′rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
The coding of the root node has a difference between the predicted probability and the true probability through the sigmoid function, and a loss value is generated. And each parameter is updated through inverse gradient propagation, so that the training effect is achieved.
And step 9: predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, such as Num, Name and the like, namely finding the node into a corresponding real-dimensional vector, encoding the vector of each node by using a residual block to obtain a new code, encoding each node from bottom to top by using an attention machine, and finally classifying by using the codes of root nodes, wherein a plurality of sigmoid classifiers can be judged to be a plurality of data structures, if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold value, such as a prediction probability higher than 70%, may also be set to consider the code as belonging to this category.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is a diagram of an abstract syntax tree of the a ═ b + c code.
Fig. 3 is a diagram of abstract syntax tree coding.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.
FIG. 1 shows a flow diagram of a method for automatically tagging code with data structures, comprising:
firstly, a crawler technology is used for collecting a large number of codes with data structures from various blogs, forums and other networks;
secondly, performing lexical analysis on the codes by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, and replacing all character strings with Str, wherein the lexical analyzer does not replace keywords corresponding to the language;
thirdly, using a syntax analyzer to analyze the syntax of the code and converting the code into an abstract syntax tree;
fourthly, coding each node in the abstract syntax tree by using a residual block, wherein each node obtains a new code coded by the residual block, coding each node from bottom to top in sequence on the tree by using an attention mechanism, fusing information of all sub-nodes of each node and a current node, and encoding the node from bottom to top by a layer until a root node on the tree is coded;
and fifthly, training the model by using a large number of label data structure codes to train the integral model, firstly, performing lexical analysis on the codes by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the codes after the lexical analysis into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector of the node, and performing nonlinear transformation on the embedded codes of each node by using a residual block to obtain new semantic codes. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e″=ReLU(Rebq(e′)+Rebc(Vc)+b)
e 'is the vector e' of the current node and the attention vector VcExpression after fusion.
Finally, the codes of the root nodes are used for classification, and because the codes can belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e′rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
The coding of the root node has a difference between the predicted probability and the true probability through the sigmoid function, and a loss value is generated. And each parameter is updated through inverse gradient propagation, so that the training effect is achieved.
Sixthly, predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, namely embedding nodes such as Num, Name and the like, namely finding corresponding real-dimensional vectors for the nodes, coding the vector of each node by using a residual block to obtain new codes, sequentially coding each node from bottom to top by using an attention system, and finally classifying by using codes of root nodes, judging as a plurality of data structures due to the use of a plurality of sigmoid classifiers, if one classifier predicts that the probability of a certain label is more than 50%, then the code belongs to the category, or a threshold value may be set, for example, if the prediction probability is higher than 70%, before the code is considered to belong to the category.
Fig. 2 shows a schematic diagram of an abstract syntax tree of a ═ b + c code, which includes Module, Assign, Name, Store, BinOp, Load, Add names. The following sequentially introduces, Module is the root node, the start of all codes; assign is an assignment symbol, specifically, in a code of a ═ b + c ═ c; the Name is an abstract Name of variable names, which variable is not clear, but the Name is a variable a, b and c as seen from the code; store is a storage symbol, and the value calculated by b + c is assigned to a and stored in a memory; BinOp is a binary operation, such as addition, subtraction, multiplication, division; load is a loading symbol and loads the value of a certain variable; add is the sign of the addition, adding the values of the two variables.
In fig. 2, the abstract syntax tree flow of the a ═ b + c code is that a Module root node is given first, how many lines of codes exist under the Module root node, and there is only one line of codes, the a ═ b + c code is mainly used for performing assignment operation, an Assign node is provided under the Module root node, the Assign node has a left sub-tree and a right sub-tree, the left sub-tree represents assignment to a variable, i.e., the variable on the left side of the equal sign in a ═ b + c, the right sub-tree represents the symbol on the right side of the equal sign in a ═ b + c, the left sub-node of the Assign node to the Name, the sub-node under the Name is a Store node, and represents assignment of the value on the right side of the equation to the left side and stores the value in the memory.
The right subnode of the Assign node of the Assign has BinOp binary operation symbols, which represent that addition, subtraction, multiplication and division are possible below. An Add addition symbol is arranged below the BinOp binary operation symbol to represent that the right side of the code equal sign is addition operation, the Name variable symbol on the left side of the Add addition symbol is a b variable in the code a ═ b + c, a subnode below the variable symbol is a Load loading symbol, and the fact that calculation needs to be carried out by using a value in the b variable is explained. The Name variable symbol on the right side of the Add addition symbol is a c variable in a code a ═ b + c, a Load symbol is arranged below the variable symbol, and calculation needs to be performed by using the value in the c variable.
In an abstract syntax tree, Module is a root node, Assign is an assignment node, calculate a BinOp binary operation symbol on the right, represent a b variable by a Name variable symbol, take out a value in the b variable by Load, represent a c variable by another Name variable symbol, take out a value in the c variable by Load, operate the values of the two variables, Add the value in the b variable and the value in the c variable by looking at an Add addition symbol, Assign the calculated value to the Name variable symbol by the Assign symbol after adding, specifically, an a variable in an a b + c code, Assign the value of the a variable to the a variable, Store the value of the a variable by using a Store symbol, and finally Store the value assigned to the a variable in a memory by the Store symbol.
Fig. 3 shows a schematic diagram of encoding an abstract syntax tree, which is described from bottom to top, where the lowermost leaf nodes are NameEmbedding, Add Embedding, and Name Embedding, where Embedding represents that the nodes are embedded, that is, the nodes are converted into real-valued vectors, then the three nodes are further encoded by using a residual block to obtain semantic codes of the three nodes, then Embedding is performed on a BinOp binary operation node, converting the BinOp binary operation node into a corresponding real-valued vector, and then encoding the BinOp node by using the residual block. Such as the following equation:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
And after encoding, calculating semantic expressions which are most relevant to all sub nodes under the current BinOp node and the current BinOp node by using an attention mechanism. Such as the following equation:
Vc=A·HT
A=softmax(score(Q,H))
q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And then, fusing the semantic coding vectors of all sub-nodes under the BinOp node obtained by the attention mechanism with the coding vectors of the BinOp node by the residual block to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all the sub-nodes. Such as the following equation:
e″=ReLU(Rebq(e′)+Rebc(Vc)+b)
e 'is the vector e' of the current node and the attention vector VcExpression after fusion.
The coding vector of the BinOp binary operation node is obtained at present, then the left sub-tree of the Assign Embedding node is coded, the left sub-tree only has the Name Embedding node, Embedding is carried out on the Name variable node, the coding vector of the node is obtained, as with the above, a residual block is used for coding the Name Embedding node to obtain a new semantic expression, all sub-nodes of the Assign Embedding node have coding vectors, Embedding is carried out on the Assign node at present, the Assign node is converted into a real value vector, and the residual block is used for recoding the Assign Embedding node vector to obtain a new semantic expression. And then, calculating the semantic expression of all sub-nodes under the current Assign Embedding node and the most relevant to the current Assign Embedding node by using an attention mechanism. And then fusing the semantic coding vectors of all sub-nodes under the design Embedding node obtained by the attention mechanism with the coding vectors of the design Embedding node by the residual block to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes.
And at the final stage of encoding, Embedding Module root nodes into the Embedding Embedding to convert the Embedding Embedding into corresponding real value vectors, and re-encoding the Module Embedding by using a residual block to obtain new semantic encoding. And then, calculating the semantic expression of all sub nodes under the current Module Embedding node and the most relevant to the current Module Embedding node by using an attention mechanism. And then, fusing the semantic coding vectors of all sub nodes under the Module Embedding node obtained by the attention mechanism with the coding vectors of the Module Embedding node by the residual block to form a new vector expression of the current node. The semantic coding vector of the Module node is obtained at present and also represents the semantic coding of the whole code, the semantic coding vector of the Module node is input into a plurality of Sigmoid functions, and each Sigmoid function judges whether the data structure is a certain data structure.
The technical scheme of the invention is described in detail in the following with the accompanying drawings:
as shown in fig. 1, the main process of the present invention is:
step 1: code for a number of annotated data structures is collected from web pages using crawler technology.
Step 2: because different codes have different grammars, different lexical analyzers are required to be used for different languages, the lexical analyzers are used for replacing variables of different types in the codes with corresponding words, the lexical analyzers replace numbers such as 1, 1.1 and the like with Num, the lexical analyzers replace all variable names with Name, the lexical analyzers replace all character strings with Str, and the lexical analyzers cannot replace keywords corresponding to the languages.
And step 3: and (4) corresponding parsers are used for different languages, and the parsers are used for converting the codes after lexical analysis into abstract syntax trees. As shown in fig. 2, the a ═ b + c code is converted to an abstract syntax tree using a python ast toolkit.
And 4, step 4: and performing word embedding on words generated after lexical analysis and syntactic analysis, such as word embedding on words of Num, Name, root node Module, assignment operation Assign and the like.
And 5: and carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes.
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
Where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Step 6: non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Q is a matrix formed by superposing vectors of n same current nodes transformed by residual blocks, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes transformed by the residual blocks, and the score function is used for calculating the expression of the current nodes and each sub-nodeThe similarity of the expression is higher, the probability after softmax is higher, and the score function can calculate the similarity between the current node and the sub-nodes in three ways, namely VcIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
And 7: according to the above formula, the expression of each node is calculated on the tree from bottom to top, and finally the expression of the root node is used for classification, and because the code possibly belongs to a plurality of categories, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structure.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e′rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
And 8: training a model, training the whole model by using a large number of standard data structure codes to ensure that the accuracy of the model for judging any code reaches more than 50 percent, wherein all the code training models with data structures are one epoch at a time, the flow of the one-time model of the one-time code training is to provide one section of code, a lexical analyzer is used for carrying out lexical analysis on the section of code, numbers such as 1, 1.1 and the like are replaced by Num, all variable names are replaced by Name, all character strings are replaced by Str, the code after the lexical analysis is converted into an abstract syntax tree by the syntax analyzer, each node in the abstract syntax tree is embedded, namely the node is found into a corresponding real-dimensional vector, and a residual block is used for carrying out nonlinear conversion on the embedded code of each node to obtain a new semantic code. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e″=ReLU(Rebq(e′)+Rebc(Vc)+b)
e 'is the vector e' of the current node and the attention vector VcExpression after fusion.
And finally, classifying the codes by using the codes of the root nodes, wherein the codes of the root nodes have a loss value caused by the difference between the prediction probability and the real probability through a sigmoid function, and updating each parameter through reverse gradient propagation so as to achieve the training effect.
And step 9: predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, such as Num, Name and the like, namely finding the node into a corresponding real-dimensional vector, encoding the vector of each node by using a residual block to obtain a new code, encoding each node from bottom to top by using an attention machine, and finally classifying by using the codes of root nodes, wherein a plurality of sigmoid classifiers can be judged to be a plurality of data structures, if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold value, such as a prediction probability higher than 70%, may also be set to consider the code as belonging to this category.
The method for automatically labeling a data structure on a code provided by the implementation of the invention is described in detail, the principle and the implementation mode of the invention are explained in the text, and the description of the implementation mode is only used for assisting in understanding the method and the core idea of the invention.
Claims (9)
1. A method for automatically tagging code with a data structure, the method comprising:
collecting a plurality of codes marked with data structures;
converting the code into an abstract syntax tree by using a lexical analyzer and a syntax analyzer;
coding the nodes on the tree by using an attention mechanism and a residual block, and labeling the codes by using codes;
training a model and predicting new codes by using the trained model, wherein the training model comprises:
training an integral model by using a large number of label data structure codes, and firstly, carrying out lexical analysis on the codes by using a lexical analyzer;
converting the codes after lexical analysis into an abstract syntax tree by using a syntax analyzer;
embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector for the node;
carrying out nonlinear transformation on the embedded codes of each node by using a residual block to obtain new semantic codes; the following formula:
e'=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is hierarchical normalization, and Reb is a residual block;
coding non-leaf nodes on the tree from bottom to top, and calculating semantic expressions which are most relevant to all sub nodes under the current node and the current node by using an attention mechanism;
Vc=A·HT
A=softmax(score(Q,H))
q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention; then the attention vector and the current node vector are fused to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes; such as the following equation:
e”=ReLU(Rebq(e')+Rebc(Vc)+b)
where e' is the vector encoding of the current node, VcIs an attention vector VcReb is a residual block, b is a bias value, ReLU is a ReLU activation function, and e 'is vector coding formed by fusing e' current node vector coding with residual block coding and Vc attention vector coding with residual block coding;
finally, the codes of the root nodes are used for classification, and because the codes possibly belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures;
yi=sigmoid(W2·ReLU(W1·e'r)+b)
e'ras semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function;
the coding of the root node has a prediction probability and a real probability difference through a sigmoid function, a loss value is generated, and each parameter is updated through reverse gradient propagation, so that the training effect is achieved.
2. The method of automatically tagging code with data structures according to claim 1, wherein said collecting a plurality of data structure tagged codes comprises:
hundreds of thousands of codes are collected from the web through a crawler technology, wherein corresponding data structures are labeled, and the data structures comprise trees, linked lists and queues.
3. The method of automatically tagging a data structure in code according to claim 1, wherein said using a lexical analyzer comprises using a lexical analyzer to replace different types of variables in code with corresponding words due to different grammars of the code requiring different lexical analyzers for different languages.
4. The method of automatically tagging code with data structures of claim 1, wherein said using a parser comprises using a corresponding parser for different languages to convert lexically parsed code into an abstract syntax tree using a parser.
5. The method for automatically tagging a data structure to a code according to claim 1, wherein said encoding a node on a tree using an attention mechanism and a residual block comprises:
performing word embedding on words generated after lexical analysis and syntactic analysis to convert the words into real-valued vectors;
all nodes are encoded on the tree using attention mechanisms and residual blocks.
6. The method for automatically tagging a code with a data structure according to claim 1, wherein said tagging a code with a code comprises:
classifying using the codes of the root nodes, wherein the codes can belong to a plurality of categories, so that a plurality of sigmoid classifiers are used for obtaining tags of a plurality of data structures;
yi=sigmoid(W2·ReLU(W1·e'r)+b)
wherein e'rFor semantic expression of a root node, W1, W2 and b are parameters needing learning, ReLU is a ReLU activation function, and sigmoid is a sigmoid function.
7. The method for automatically tagging data structures in code according to claim 1, wherein said predicting new code using a trained model comprises:
predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers 1 and 1.1 with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, embedding Num and Name nodes, namely finding the corresponding real-dimensional vector for the node, coding the vector of each node by using a residual block to obtain new code, sequentially coding each node from bottom to top by using an attention mechanism, and finally classifying by using the coding of a root node, judging as a plurality of data structures due to the use of a plurality of sigmoid classifiers, wherein if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold is set, e.g., the prediction probability is higher than 70%, and the code is considered to belong to this category.
8. The method for automatically tagging a code with a data structure according to claim 5, wherein said using residual block coding on a tree comprises:
carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes of the node; the following formula:
e'=Rebq(e)=LN(W2·ReLU(W1·e)+e)
where e is the embedded code corresponding to the current node, e ∈ Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
9. The method for automatically tagging code with a data structure according to claim 5, wherein said encoding on a tree using an attention mechanism comprises:
coding non-leaf nodes on the tree from bottom to top, and calculating semantic expressions which are most relevant to all sub nodes under the current node and the current node by using an attention mechanism;
Vc=A·HT
A=softmax(score(Q,H))
q is a matrix formed by superposing vectors of n same current nodes transformed by the residual blocks, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes transformed by the residual blocks, scorThe e function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the more the score function can calculate the similarity between the current node and the sub-node through three modes, and V iscFor the attention expression, the attention vector and the current node vector are fused to form a new vector expression of the current node, and the vector expression of the current node comprises the current node vector expression
The semantic expression of the self also comprises the semantic expression of all sub nodes; such as the following equation:
e”=ReLU(Rebq(e')+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcThe fused representation, ReLU, is the ReLU activation function, Reb residual block.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910304797.7A CN110008344B (en) | 2019-04-16 | 2019-04-16 | Method for automatically marking data structure label on code |
CN202011019000.8A CN112148879B (en) | 2019-04-16 | 2019-04-16 | Computer readable storage medium for automatically labeling code with data structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910304797.7A CN110008344B (en) | 2019-04-16 | 2019-04-16 | Method for automatically marking data structure label on code |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011019000.8A Division CN112148879B (en) | 2019-04-16 | 2019-04-16 | Computer readable storage medium for automatically labeling code with data structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110008344A CN110008344A (en) | 2019-07-12 |
CN110008344B true CN110008344B (en) | 2020-09-29 |
Family
ID=67172257
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910304797.7A Active CN110008344B (en) | 2019-04-16 | 2019-04-16 | Method for automatically marking data structure label on code |
CN202011019000.8A Active CN112148879B (en) | 2019-04-16 | 2019-04-16 | Computer readable storage medium for automatically labeling code with data structure |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011019000.8A Active CN112148879B (en) | 2019-04-16 | 2019-04-16 | Computer readable storage medium for automatically labeling code with data structure |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110008344B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113139054B (en) * | 2021-04-21 | 2023-11-24 | 南通大学 | Code programming language classification method based on Transformer |
CN116661805B (en) * | 2023-07-31 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Code representation generation method and device, storage medium and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063488A (en) * | 2010-12-29 | 2011-05-18 | 南京航空航天大学 | Code searching method based on semantics |
CN102339252A (en) * | 2011-07-25 | 2012-02-01 | 大连理工大学 | Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching |
US10169208B1 (en) * | 2014-11-03 | 2019-01-01 | Charles W Moyes | Similarity scoring of programs |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7590644B2 (en) * | 1999-12-21 | 2009-09-15 | International Business Machine Corporation | Method and apparatus of streaming data transformation using code generator and translator |
US20040068716A1 (en) * | 2002-10-04 | 2004-04-08 | Quicksilver Technology, Inc. | Retargetable compiler for multiple and different hardware platforms |
KR101044870B1 (en) * | 2008-10-02 | 2011-06-28 | 한국전자통신연구원 | Method and Apparatus for Encoding and Decoding XML Documents Using Path Code |
CN101614787B (en) * | 2009-07-07 | 2011-05-18 | 南京航空航天大学 | Analogical electronic circuit fault diagnostic method based on M-ary-structure classifier |
CN103064969A (en) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | Method for automatically creating keyword index table |
US20160124723A1 (en) * | 2014-10-31 | 2016-05-05 | Weixi Ma | Graphically building abstract syntax trees |
US10565318B2 (en) * | 2017-04-14 | 2020-02-18 | Salesforce.Com, Inc. | Neural machine translation with latent tree attention |
CN107220180B (en) * | 2017-06-08 | 2020-08-04 | 电子科技大学 | Code classification method based on neural network language model |
CN108399158B (en) * | 2018-02-05 | 2021-05-14 | 华南理工大学 | Attribute emotion classification method based on dependency tree and attention mechanism |
CN108446540B (en) * | 2018-03-19 | 2022-02-25 | 中山大学 | Program code plagiarism type detection method and system based on source code multi-label graph neural network |
CN108829823A (en) * | 2018-06-13 | 2018-11-16 | 北京信息科技大学 | A kind of file classification method |
CN109033069B (en) * | 2018-06-16 | 2022-05-17 | 天津大学 | Microblog theme mining method based on social media user dynamic behaviors |
CN109241834A (en) * | 2018-07-27 | 2019-01-18 | 中山大学 | A kind of group behavior recognition methods of the insertion based on hidden variable |
CN110188104A (en) * | 2019-05-30 | 2019-08-30 | 中森云链(成都)科技有限责任公司 | A kind of Python program code method for fast searching towards K12 programming |
-
2019
- 2019-04-16 CN CN201910304797.7A patent/CN110008344B/en active Active
- 2019-04-16 CN CN202011019000.8A patent/CN112148879B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063488A (en) * | 2010-12-29 | 2011-05-18 | 南京航空航天大学 | Code searching method based on semantics |
CN102339252A (en) * | 2011-07-25 | 2012-02-01 | 大连理工大学 | Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching |
US10169208B1 (en) * | 2014-11-03 | 2019-01-01 | Charles W Moyes | Similarity scoring of programs |
Non-Patent Citations (2)
Title |
---|
An AST-based Code Plagiarism Detection Algorithm;Jingling Zhao,and etc;《2015 10th International Conference on Broadband and Wireless Computing, Communication and Applications (BWCCA)》;20160303;第178-182页 * |
优化构建逻辑函数的语法树;张明新;《科技视界》;20180619;第78-80页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112148879B (en) | 2023-06-23 |
CN112148879A (en) | 2020-12-29 |
CN110008344A (en) | 2019-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111160008B (en) | Entity relationship joint extraction method and system | |
CN107330032B (en) | Implicit discourse relation analysis method based on recurrent neural network | |
CN111694924A (en) | Event extraction method and system | |
CN109960728B (en) | Method and system for identifying named entities of open domain conference information | |
CN111897908A (en) | Event extraction method and system fusing dependency information and pre-training language model | |
CN113822026B (en) | Multi-label entity labeling method | |
CN113254675B (en) | Knowledge graph construction method based on self-adaptive few-sample relation extraction | |
CN116661805B (en) | Code representation generation method and device, storage medium and electronic equipment | |
CN113868432A (en) | Automatic knowledge graph construction method and system for iron and steel manufacturing enterprises | |
CN114911945A (en) | Knowledge graph-based multi-value chain data management auxiliary decision model construction method | |
CN117151222B (en) | Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium | |
CN110008344B (en) | Method for automatically marking data structure label on code | |
CN110781271A (en) | Semi-supervised network representation learning model based on hierarchical attention mechanism | |
CN109614612A (en) | A kind of Chinese text error correction method based on seq2seq+attention | |
CN114492460B (en) | Event causal relationship extraction method based on derivative prompt learning | |
CN111340006B (en) | Sign language recognition method and system | |
CN116245097A (en) | Method for training entity recognition model, entity recognition method and corresponding device | |
CN117033423A (en) | SQL generating method for injecting optimal mode item and historical interaction information | |
CN111597816A (en) | Self-attention named entity recognition method, device, equipment and storage medium | |
CN113361259B (en) | Service flow extraction method | |
CN117390189A (en) | Neutral text generation method based on pre-classifier | |
CN115408506B (en) | NL2SQL method combining semantic analysis and semantic component matching | |
CN116186241A (en) | Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium | |
CN115186670A (en) | Method and system for identifying domain named entities based on active learning | |
CN114154505A (en) | Named entity identification method for power planning review field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A method of automatically labeling code with data structure Effective date of registration: 20220509 Granted publication date: 20200929 Pledgee: Bank of Chengdu science and technology branch of Limited by Share Ltd. Pledgor: ZHONGSENYUNLIAN (CHENGDU) TECHNOLOGY Co.,Ltd. Registration number: Y2022980005318 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |