CN112148879A - Computer readable storage medium for automatically labeling code with data structure - Google Patents

Computer readable storage medium for automatically labeling code with data structure Download PDF

Info

Publication number
CN112148879A
CN112148879A CN202011019000.8A CN202011019000A CN112148879A CN 112148879 A CN112148879 A CN 112148879A CN 202011019000 A CN202011019000 A CN 202011019000A CN 112148879 A CN112148879 A CN 112148879A
Authority
CN
China
Prior art keywords
node
code
vector
nodes
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011019000.8A
Other languages
Chinese (zh)
Other versions
CN112148879B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongsenyunlian Chengdu Technology Co ltd
Original Assignee
Zhongsenyunlian Chengdu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongsenyunlian Chengdu Technology Co ltd filed Critical Zhongsenyunlian Chengdu Technology Co ltd
Priority to CN202011019000.8A priority Critical patent/CN112148879B/en
Publication of CN112148879A publication Critical patent/CN112148879A/en
Application granted granted Critical
Publication of CN112148879B publication Critical patent/CN112148879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a computer readable storage medium for automatically marking a code with a data structure label, which belongs to the field of natural language processing under artificial intelligence, and comprises the following steps: converting the code into an abstract syntax tree using a lexical analyzer and a syntax analyzer; modeling the abstract syntax tree, and coding each node on the tree from bottom to top by using an attention mechanism and a residual block to obtain the code of the whole tree; and finally, labeling the code with a data structure through a classifier in the model. According to the method, the data structure label can be automatically marked on the code, and the workload of manually marking the data structure label on the code is reduced.

Description

Computer readable storage medium for automatically labeling code with data structure
The application is a divisional application with application date of 2019, 4, month and 16, application number of 2019103047977 and the name of 'a method for automatically labeling a data structure for a code'.
Technical Field
The invention belongs to the field of natural language processing under artificial intelligence, and particularly relates to a computer readable storage medium for automatically marking a code with a data structure label.
Background
With the popularization of the internet, a large number of high-quality codes appear on the internet, but many codes do not have labels of data structures, so that the user is inconvenient to inquire and learn, and the manual marking of data structure labels for massive codes is unrealistic.
Disclosure of Invention
The invention provides a computer readable storage medium for automatically labeling a code with a data structure. A computer readable medium having stored thereon a computer program, wherein said program when executed by a processor implements a method for automatically tagging data structures in a code, the method using a lexical analyzer and a syntax analyzer to convert the code into an abstract syntax tree, then embedding words in each word, sequentially encoding each node from bottom to top using a residual block and an attention mechanism on the tree, and finally obtaining a coding of a root node, the coding including both syntax and semantic expressions of all sub-nodes and a semantic expression of its own node, and finally classifying using the expression of the root node, because a section of the code may contain multiple data structures, multiple sigmoid classifiers are used to obtain multiple data structure tags.
The present invention is a computer readable medium having stored thereon a computer program characterized in that said program is executed by a processor for a method of automatically tagging code with a data structure, comprising the steps of:
step 1: code for a number of annotated data structures is collected from web pages using crawler technology.
Step 2: because different codes have different grammars, different lexical analyzers are required to be used for different languages, the lexical analyzers are used for replacing variables of different types in the codes with corresponding words, and the lexical analyzers replace numbers such as 1, 1.1 and the like with Num; the lexical analyzer replaces all variable names with names; all the character strings of the lexical analyzer are replaced by Str, wherein the lexical analyzer does not replace the keywords corresponding to the language.
And step 3: and (4) corresponding parsers are used for different languages, and the parsers are used for converting the codes after lexical analysis into abstract syntax trees.
And 4, step 4: and performing word embedding on words generated after lexical analysis and syntactic analysis, and performing word embedding on words such as Num, Name, root node Module, assignment operation Assign and the like.
And 5: and carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes.
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
Wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Step 6: non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000021
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity of the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the more the score function can calculate the similarity of the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e″=ReLU(Rebq(e′)+Rebc(Vc)+b)
where e' is the vector encoding of the current node, VcFor the attention vector, Reb is a residual block, b is a bias value, ReLU is a ReLU activation function, and e 'is vector coding fused by using residual block coding for e' current node vector coding and using residual block coding for Vc attention vector coding.
And 7: according to the above formula, the expression of each node is calculated on the tree from bottom to top, and finally the expression of the root node is used for classification, and because the code possibly belongs to a plurality of categories, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structure.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
Wherein e'rFor semantic expression of a root node, W1 and W2 are parameters, b is a bias value, ReLU is a ReLU activation function, and sigmoid is a sigmoid function.
And 8: the method comprises the steps of training a model, using a large number of label data structure codes to train an integral model, firstly using a lexical analyzer to carry out lexical analysis on the codes, replacing numbers such as 1 and 1.1 with Num, replacing all variable names with Name, replacing all character strings with Str, using a grammar analyzer to convert the codes after the lexical analysis into an abstract syntax tree, embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector of the node, and using a residual block to carry out nonlinear transformation on the embedded codes of each node to obtain new semantic codes. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000031
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
Finally, the codes of the root nodes are used for classification, and because the codes can belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e’rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
The coding of the root node has a difference between the predicted probability and the true probability through the sigmoid function, and a loss value is generated. And each parameter is updated through inverse gradient propagation, so that the training effect is achieved.
And step 9: predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, such as Num, Name and the like, namely finding the node into a corresponding real-dimensional vector, encoding the vector of each node by using a residual block to obtain a new code, encoding each node from bottom to top by using an attention machine, and finally classifying by using the codes of root nodes, wherein a plurality of sigmoid classifiers can be judged to be a plurality of data structures, if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold value, such as a prediction probability higher than 70%, may also be set to consider the code as belonging to this category.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is a diagram of an abstract syntax tree of the a ═ b + c code.
Fig. 3 is a diagram of abstract syntax tree coding.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.
FIG. 1 shows a flow diagram of a method for automatically tagging code with data structures, comprising:
firstly, a crawler technology is used for collecting a large number of codes with data structures from various blogs, forums and other networks;
secondly, performing lexical analysis on the codes by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, and replacing all character strings with Str, wherein the lexical analyzer does not replace keywords corresponding to the language;
thirdly, using a syntax analyzer to analyze the syntax of the code and converting the code into an abstract syntax tree;
fourthly, coding each node in the abstract syntax tree by using a residual block, wherein each node obtains a new code coded by the residual block, coding each node from bottom to top in sequence on the tree by using an attention mechanism, fusing information of all sub-nodes of each node and a current node, and encoding the node from bottom to top by a layer until a root node on the tree is coded;
and fifthly, training the model by using a large number of label data structure codes to train the integral model, firstly, performing lexical analysis on the codes by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the codes after the lexical analysis into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector of the node, and performing nonlinear transformation on the embedded codes of each node by using a residual block to obtain new semantic codes. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000051
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
Finally, the codes of the root nodes are used for classification, and because the codes can belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e’rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
The coding of the root node has a difference between the predicted probability and the true probability through the sigmoid function, and a loss value is generated. And each parameter is updated through inverse gradient propagation, so that the training effect is achieved.
Sixthly, predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, namely embedding nodes such as Num, Name and the like, namely finding corresponding real-dimensional vectors for the nodes, coding the vector of each node by using a residual block to obtain new codes, sequentially coding each node from bottom to top by using an attention system, and finally classifying by using codes of root nodes, judging as a plurality of data structures due to the use of a plurality of sigmoid classifiers, if one classifier predicts that the probability of a certain label is more than 50%, then the code belongs to the category, or a threshold value may be set, for example, if the prediction probability is higher than 70%, before the code is considered to belong to the category.
Fig. 2 shows a schematic diagram of an abstract syntax tree of a ═ b + c code, which includes Module, Assign, Name, Store, BinOp, Load, Add names. The following sequentially introduces, Module is the root node, the start of all codes; assign is an assignment symbol, specifically, in a code of a ═ b + c ═ c; the Name is an abstract Name of variable names, which variable is not clear, but the Name is a variable a, b and c as seen from the code; store is a storage symbol, and the value calculated by b + c is assigned to a and stored in a memory; BinOp is a binary operation, such as addition, subtraction, multiplication, division; load is a loading symbol and loads the value of a certain variable; add is the sign of the addition, adding the values of the two variables.
In fig. 2, the abstract syntax tree flow of the a ═ b + c code is that a Module root node is given first, how many lines of codes exist under the Module root node, and there is only one line of codes, the a ═ b + c code is mainly used for performing assignment operation, an Assign node is provided under the Module root node, the Assign node has a left sub-tree and a right sub-tree, the left sub-tree represents assignment to a variable, i.e., the variable on the left side of the equal sign in a ═ b + c, the right sub-tree represents the symbol on the right side of the equal sign in a ═ b + c, the left sub-node of the Assign node to the Name, the sub-node under the Name is a Store node, and represents assignment of the value on the right side of the equation to the left side and stores the value in the memory.
The right subnode of the Assign node of the Assign has BinOp binary operation symbols, which represent that addition, subtraction, multiplication and division are possible below. An Add addition symbol is arranged below the BinOp binary operation symbol to represent that the right side of the code equal sign is addition operation, the Name variable symbol on the left side of the Add addition symbol is a b variable in the code a ═ b + c, a subnode below the variable symbol is a Load loading symbol, and the fact that calculation needs to be carried out by using a value in the b variable is explained. The Name variable symbol on the right side of the Add addition symbol is a c variable in a code a ═ b + c, a Load symbol is arranged below the variable symbol, and calculation needs to be performed by using the value in the c variable.
In an abstract syntax tree, Module is a root node, Assign is an assignment node, calculate a BinOp binary operation symbol on the right, represent a b variable by a Name variable symbol, take out a value in the b variable by Load, represent a c variable by another Name variable symbol, take out a value in the c variable by Load, operate the values of the two variables, Add the value in the b variable and the value in the c variable by looking at an Add addition symbol, Assign the calculated value to the Name variable symbol by the Assign symbol after adding, specifically, an a variable in an a b + c code, Assign the value of the a variable to the a variable, Store the value of the a variable by using a Store symbol, and finally Store the value assigned to the a variable in a memory by the Store symbol.
Fig. 3 shows a schematic diagram of encoding an abstract syntax tree, which is described from bottom to top, where the lowermost leaf nodes are Name Embedding, Add Embedding, and Name Embedding, where Embedding represents that the nodes are embedded, that is, the nodes are converted into real-valued vectors, then the three nodes are further encoded by using a residual block to obtain semantic codes of the three nodes, then Embedding is performed on a BinOp binary operation node, converting the BinOp binary operation node into a corresponding real-valued vector, and then encoding the BinOp node by using the residual block. Such as the following equation:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, Reb is a residual errorAnd (5) blocking.
And after encoding, calculating semantic expressions which are most relevant to all sub nodes under the current BinOp node and the current BinOp node by using an attention mechanism. Such as the following equation:
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000081
q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And then, fusing the semantic coding vectors of all sub-nodes under the BinOp node obtained by the attention mechanism with the coding vectors of the BinOp node by the residual block to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all the sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
The coding vector of the BinOp binary operation node is obtained at present, then the left sub-tree of the Assign Embedding node is coded, the left sub-tree only has the Name Embedding node, Embedding is carried out on the Name variable node, the coding vector of the node is obtained, as with the above, a residual block is used for coding the Name Embedding node to obtain a new semantic expression, all sub-nodes of the Assign Embedding node have coding vectors, Embedding is carried out on the Assign node at present, the Assign node is converted into a real value vector, and the residual block is used for recoding the Assign Embedding node vector to obtain a new semantic expression. And then, calculating the semantic expression of all sub-nodes under the current Assign Embedding node and the most relevant to the current Assign Embedding node by using an attention mechanism. And then fusing the semantic coding vectors of all sub-nodes under the design Embedding node obtained by the attention mechanism with the coding vectors of the design Embedding node by the residual block to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes.
And at the final stage of encoding, Embedding Module root nodes into the Embedding Embedding to convert the Embedding Embedding into corresponding real value vectors, and re-encoding the Module Embedding by using a residual block to obtain new semantic encoding. And then, calculating the semantic expression of all sub nodes under the current Module Embedding node and the most relevant to the current Module Embedding node by using an attention mechanism. And then, fusing the semantic coding vectors of all sub nodes under the Module Embedding node obtained by the attention mechanism with the coding vectors of the Module Embedding node by the residual block to form a new vector expression of the current node. The semantic coding vector of the Module node is obtained at present and also represents the semantic coding of the whole code, the semantic coding vector of the Module node is input into a plurality of Sigmoid functions, and each Sigmoid function judges whether the data structure is a certain data structure.
The technical scheme of the invention is described in detail in the following with the accompanying drawings:
as shown in fig. 1, the main process of the present invention is:
step 1: code for a number of annotated data structures is collected from web pages using crawler technology.
Step 2: because different codes have different grammars, different lexical analyzers are required to be used for different languages, the lexical analyzers are used for replacing variables of different types in the codes with corresponding words, the lexical analyzers replace numbers such as 1, 1.1 and the like with Num, the lexical analyzers replace all variable names with Name, the lexical analyzers replace all character strings with Str, and the lexical analyzers cannot replace keywords corresponding to the languages.
And step 3: and (4) corresponding parsers are used for different languages, and the parsers are used for converting the codes after lexical analysis into abstract syntax trees. As shown in fig. 2, the a ═ b + c code is converted to an abstract syntax tree using a python ast toolkit.
And 4, step 4: and performing word embedding on words generated after lexical analysis and syntactic analysis, such as word embedding on words of Num, Name, root node Module, assignment operation Assign and the like.
And 5: and carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes.
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
Wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
Step 6: non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000101
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
And 7: according to the above formula, the expression of each node is calculated on the tree from bottom to top, and finally the expression of the root node is used for classification, and because the code possibly belongs to a plurality of categories, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structure.
yi=sigmoid(W2·ReLU(W1·e′r)+b)
e’rFor semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function.
And 8: training a model, training the whole model by using a large number of standard data structure codes to ensure that the accuracy of the model for judging any code reaches more than 50 percent, wherein all the code training models with data structures are one epoch at a time, the flow of the one-time model of the one-time code training is to provide one section of code, a lexical analyzer is used for carrying out lexical analysis on the section of code, numbers such as 1, 1.1 and the like are replaced by Num, all variable names are replaced by Name, all character strings are replaced by Str, the code after the lexical analysis is converted into an abstract syntax tree by the syntax analyzer, each node in the abstract syntax tree is embedded, namely the node is found into a corresponding real-dimensional vector, and a residual block is used for carrying out nonlinear conversion on the embedded code of each node to obtain a new semantic code. The following formula:
e′=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, Reb is a residual errorAnd (5) blocking.
Non-leaf nodes are encoded on the tree from bottom to top, and the semantic expression of all sub-nodes under the current node which is most relevant to the current node is calculated by using an attention mechanism.
Vc=A·HT
A=softmax(score(Q,H))
Figure BDA0002698741740000111
Q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention.
And fusing the attention vector and the current node vector to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes. Such as the following equation:
e′=ReLU(Rebq(e′)+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcExpression after fusion.
And finally, classifying the codes by using the codes of the root nodes, wherein the codes of the root nodes have a loss value caused by the difference between the prediction probability and the real probability through a sigmoid function, and updating each parameter through reverse gradient propagation so as to achieve the training effect.
And step 9: predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers such as 1, 1.1 and the like with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, such as Num, Name and the like, namely finding the node into a corresponding real-dimensional vector, encoding the vector of each node by using a residual block to obtain a new code, encoding each node from bottom to top by using an attention machine, and finally classifying by using the codes of root nodes, wherein a plurality of sigmoid classifiers can be judged to be a plurality of data structures, if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold value, such as a prediction probability higher than 70%, may also be set to consider the code as belonging to this category.
As another aspect, the present invention also provides a computer-readable medium, which may be
Contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to implement the configuration information based data query method of the present invention.
The method for automatically labeling a data structure on a code provided by the implementation of the invention is described in detail, the principle and the implementation mode of the invention are explained in the text, and the description of the implementation mode is only used for assisting in understanding the method and the core idea of the invention.

Claims (9)

1. A computer-readable medium, on which a computer program is stored, which program, when executed by a processor, implements a method of automatically tagging code with a data structure, comprising:
collecting a plurality of codes marked with data structures;
converting the code into an abstract syntax tree by using a lexical analyzer and a syntax analyzer;
coding the nodes on the tree by using an attention mechanism and a residual block, and labeling the codes by using codes;
training a model and predicting new codes by using the trained model, wherein the training model comprises:
training an integral model by using a large number of label data structure codes, and firstly, carrying out lexical analysis on the codes by using a lexical analyzer;
converting the codes after lexical analysis into an abstract syntax tree by using a syntax analyzer;
embedding each node in the abstract syntax tree, namely finding the corresponding real-dimensional vector for the node;
carrying out nonlinear transformation on the embedded codes of each node by using a residual block to obtain new semantic codes; the following formula:
e'=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is hierarchical normalization, and Reb is a residual block;
coding non-leaf nodes on the tree from bottom to top, and calculating semantic expressions which are most relevant to all sub nodes under the current node and the current node by using an attention mechanism;
Vc=A·HT
A=softmax(score(Q,H))
Figure FDA0002698741730000011
q is a matrix formed by superposing vectors of n same current nodes after residual block transformation, H is a matrix formed by superposing vectors of n sub-nodes under the current nodes after residual block transformation, the score function is used for calculating the similarity between the current node expression and each sub-node expression, the higher the similarity is, the higher the probability after softmax is, the score function can calculate the similarity between the current node and the sub-nodes in three modes, and V iscIs an expression of attention; then the attention vector and the current node vector are fused to form a new vector expression of the current nodeAt present, the vector expression of the current node not only comprises the semantic expression of the current node, but also comprises the semantic expression of all sub-nodes; such as the following equation:
e”=ReLU(Rebq(e')+Rebc(Vc)+b)
e 'where e' is the vector encoding of the current node, VcIs an attention vector VcReb is a residual block, b is a bias value, ReLU is a ReLU activation function, and e 'is vector coding formed by fusing e' current node vector coding with residual block coding and Vc attention vector coding with residual block coding;
finally, the codes of the root nodes are used for classification, and because the codes possibly belong to a plurality of classes, a plurality of sigmoid classifiers are used for obtaining a plurality of labels of the data structures;
yi=sigmoid(W2·ReLU(W1·e'r)+b)
e'ras semantic expression of a root node, the ReLU is a ReLU activation function, and the sigmoid is a sigmoid function;
the coding of the root node has a prediction probability and a real probability difference through a sigmoid function, a loss value is generated, and each parameter is updated through reverse gradient propagation, so that the training effect is achieved.
2. The computer-readable medium of claim 1, wherein collecting the plurality of tagged data structures comprises collecting hundreds of thousands of code tagged with corresponding data structures from a web via a crawler technique, wherein the data structures comprise trees, linked lists, and queues.
3. The computer-readable medium of claim 1, wherein the parsing the lexical representation includes using a lexical analyzer to replace different types of variables in the code with corresponding words, wherein different lexical analyzers for different languages are required due to different grammars of the code.
4. The computer-readable medium of claim 1, wherein the parsing includes translating the lexically analyzed code into an abstract syntax tree using a parser for different languages using corresponding parsers.
5. The computer-readable medium of claim 1, wherein encoding the nodes on the tree using an attention mechanism and a residual block comprises:
performing word embedding on words generated after lexical analysis and syntactic analysis to convert the words into real-valued vectors;
all nodes are encoded on the tree using attention mechanisms and residual blocks.
6. The computer-readable medium of claim 1, wherein the tagging a code with a code using encoding comprises:
classifying using the codes of the root nodes, wherein the codes can belong to a plurality of categories, so that a plurality of sigmoid classifiers are used for obtaining tags of a plurality of data structures;
yi=sigmoid(W2·ReLU(W1·e'r)+b)
wherein e'rFor semantic expression of a root node, W1, W2 and b are parameters needing learning, ReLU is a ReLU activation function, and sigmoid is a sigmoid function.
7. The computer-readable medium of claim 1, wherein predicting the new code using the trained model comprises:
predicting a new code by using a trained model to obtain a section of new code, performing lexical analysis on the section of new code by using a lexical analyzer, replacing numbers 1 and 1.1 with Num, replacing all variable names with Name, replacing all character strings with Str, converting the lexical analyzed code into an abstract syntax tree by using a syntax analyzer, embedding each node in the abstract syntax tree, embedding Num and Name nodes, namely finding the corresponding real-dimensional vector for the node, coding the vector of each node by using a residual block to obtain new code, sequentially coding each node from bottom to top by using an attention mechanism, and finally classifying by using the coding of a root node, judging as a plurality of data structures due to the use of a plurality of sigmoid classifiers, wherein if one classifier predicts that the probability of a certain label is more than 50 percent, the section of code belongs to the category, a threshold is set, e.g., the prediction probability is higher than 70%, and the code is considered to belong to this category.
8. The computer-readable medium of claim 5, wherein the using residual block coding on the tree comprises:
carrying out nonlinear transformation on the embedded codes of each node by using the same residual block Reb to obtain new semantic codes of the node; the following formula:
e'=Rebq(e)=LN(W2·ReLU(W1·e)+e)
wherein e is the embedded code corresponding to the current node, e belongs to Rembedding_sizeEmbedding size is the dimension, W, of each node embedding1∈Rd_i×embedding_size,W2∈Rembedding_size×d_iD _ i is a hyper-parameter, ReLU is a ReLU activation function, LN is a hierarchical normalization, and Reb is a residual block.
9. The computer-readable medium of claim 5, wherein the encoding on the tree using an attention mechanism comprises:
coding non-leaf nodes on the tree from bottom to top, and calculating semantic expressions which are most relevant to all sub nodes under the current node and the current node by using an attention mechanism;
Vc=A·HT
A=softmax(score(Q,H))
Figure FDA0002698741730000031
q is a matrix formed by vector superposition of n same current nodes after residual block transformation, and H is a channel of n sub-nodes under the current nodesThe score function is a matrix obtained after vector superposition after residual block transformation, the similarity between the current node expression and each sub-node expression is calculated, the higher the similarity is, the higher the probability after softmax is, the similarity between the current node expression and the sub-nodes can be calculated through three modes, and V iscIs an expression of attention. Then the attention vector and the current node vector are fused to form a new vector expression of the current node, wherein the vector expression of the current node comprises the semantic expression of the current node and the semantic expression of all sub-nodes; such as the following equation:
e”=ReLU(Rebq(e')+Rebc(Vc)+b)
e' is the vector of the current node and the attention vector VcThe fused representation, ReLU, is the ReLU activation function, Reb residual block.
CN202011019000.8A 2019-04-16 2019-04-16 Computer readable storage medium for automatically labeling code with data structure Active CN112148879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011019000.8A CN112148879B (en) 2019-04-16 2019-04-16 Computer readable storage medium for automatically labeling code with data structure

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910304797.7A CN110008344B (en) 2019-04-16 2019-04-16 Method for automatically marking data structure label on code
CN202011019000.8A CN112148879B (en) 2019-04-16 2019-04-16 Computer readable storage medium for automatically labeling code with data structure

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910304797.7A Division CN110008344B (en) 2019-04-16 2019-04-16 Method for automatically marking data structure label on code

Publications (2)

Publication Number Publication Date
CN112148879A true CN112148879A (en) 2020-12-29
CN112148879B CN112148879B (en) 2023-06-23

Family

ID=67172257

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910304797.7A Active CN110008344B (en) 2019-04-16 2019-04-16 Method for automatically marking data structure label on code
CN202011019000.8A Active CN112148879B (en) 2019-04-16 2019-04-16 Computer readable storage medium for automatically labeling code with data structure

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910304797.7A Active CN110008344B (en) 2019-04-16 2019-04-16 Method for automatically marking data structure label on code

Country Status (1)

Country Link
CN (2) CN110008344B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139054B (en) * 2021-04-21 2023-11-24 南通大学 Code programming language classification method based on Transformer
CN116661805B (en) * 2023-07-31 2023-11-14 腾讯科技(深圳)有限公司 Code representation generation method and device, storage medium and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050273772A1 (en) * 1999-12-21 2005-12-08 Nicholas Matsakis Method and apparatus of streaming data transformation using code generator and translator
CN101614787A (en) * 2009-07-07 2009-12-30 南京航空航天大学 Analogical Electronics method for diagnosing faults based on M-ary textural classification device
CN102439589A (en) * 2008-10-02 2012-05-02 韩国电子通信研究院 Method and apparatus for encoding and decoding xml documents using path code
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
US20160124723A1 (en) * 2014-10-31 2016-05-05 Weixi Ma Graphically building abstract syntax trees
CN108399158A (en) * 2018-02-05 2018-08-14 华南理工大学 Attribute sensibility classification method based on dependency tree and attention mechanism
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
WO2018191344A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Neural machine translation with latent tree attention
CN108829823A (en) * 2018-06-13 2018-11-16 北京信息科技大学 A kind of file classification method
CN109033069A (en) * 2018-06-16 2018-12-18 天津大学 A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable
CN110188104A (en) * 2019-05-30 2019-08-30 中森云链(成都)科技有限责任公司 A kind of Python program code method for fast searching towards K12 programming

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040068716A1 (en) * 2002-10-04 2004-04-08 Quicksilver Technology, Inc. Retargetable compiler for multiple and different hardware platforms
CN102063488A (en) * 2010-12-29 2011-05-18 南京航空航天大学 Code searching method based on semantics
CN102339252B (en) * 2011-07-25 2014-04-23 大连理工大学 Static state detecting system based on XML (Extensive Makeup Language) middle model and defect mode matching
US10169208B1 (en) * 2014-11-03 2019-01-01 Charles W Moyes Similarity scoring of programs
CN107220180B (en) * 2017-06-08 2020-08-04 电子科技大学 Code classification method based on neural network language model

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050273772A1 (en) * 1999-12-21 2005-12-08 Nicholas Matsakis Method and apparatus of streaming data transformation using code generator and translator
CN102439589A (en) * 2008-10-02 2012-05-02 韩国电子通信研究院 Method and apparatus for encoding and decoding xml documents using path code
CN101614787A (en) * 2009-07-07 2009-12-30 南京航空航天大学 Analogical Electronics method for diagnosing faults based on M-ary textural classification device
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
US20160124723A1 (en) * 2014-10-31 2016-05-05 Weixi Ma Graphically building abstract syntax trees
WO2018191344A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Neural machine translation with latent tree attention
CN108399158A (en) * 2018-02-05 2018-08-14 华南理工大学 Attribute sensibility classification method based on dependency tree and attention mechanism
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN108829823A (en) * 2018-06-13 2018-11-16 北京信息科技大学 A kind of file classification method
CN109033069A (en) * 2018-06-16 2018-12-18 天津大学 A kind of microblogging Topics Crawling method based on Social Media user's dynamic behaviour
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable
CN110188104A (en) * 2019-05-30 2019-08-30 中森云链(成都)科技有限责任公司 A kind of Python program code method for fast searching towards K12 programming

Also Published As

Publication number Publication date
CN110008344A (en) 2019-07-12
CN110008344B (en) 2020-09-29
CN112148879B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111160008B (en) Entity relationship joint extraction method and system
CN113642330B (en) Rail transit standard entity identification method based on catalogue theme classification
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN113177124B (en) Method and system for constructing knowledge graph in vertical field
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN111897908A (en) Event extraction method and system fusing dependency information and pre-training language model
CN111694924A (en) Event extraction method and system
CN113822026B (en) Multi-label entity labeling method
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN113168499A (en) Method for searching patent document
CN113254675B (en) Knowledge graph construction method based on self-adaptive few-sample relation extraction
CN113196277A (en) System for retrieving natural language documents
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114911945A (en) Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN110008344B (en) Method for automatically marking data structure label on code
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN117390189A (en) Neutral text generation method based on pre-classifier
CN116186241A (en) Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium
CN115906818A (en) Grammar knowledge prediction method, grammar knowledge prediction device, electronic equipment and storage medium
CN115860002A (en) Combat task generation method and system based on event extraction
CN114528459A (en) Semantic-based webpage information extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant