CN112035099A - Vectorization representation method and device for nodes in abstract syntax tree - Google Patents

Vectorization representation method and device for nodes in abstract syntax tree Download PDF

Info

Publication number
CN112035099A
CN112035099A CN202010907349.9A CN202010907349A CN112035099A CN 112035099 A CN112035099 A CN 112035099A CN 202010907349 A CN202010907349 A CN 202010907349A CN 112035099 A CN112035099 A CN 112035099A
Authority
CN
China
Prior art keywords
sequence
syntax tree
abstract syntax
processed
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010907349.9A
Other languages
Chinese (zh)
Other versions
CN112035099B (en
Inventor
董叶豪
刘盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202010907349.9A priority Critical patent/CN112035099B/en
Publication of CN112035099A publication Critical patent/CN112035099A/en
Application granted granted Critical
Publication of CN112035099B publication Critical patent/CN112035099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application provides a vectorization representation method and a vectorization representation device for nodes in an abstract syntax tree, which relate to the technical field of computers, and the vectorization representation method for the nodes in the abstract syntax tree comprises the following steps: firstly, acquiring an abstract syntax tree to be processed; then, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree again to obtain a second sequence; further, generating a coding sequence to be processed according to the first sequence and the second sequence; and finally, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree. Therefore, the method can completely cover all the nodes in the abstract syntax tree, and further accurately carry out vectorization representation on the nodes in the abstract syntax tree.

Description

Vectorization representation method and device for nodes in abstract syntax tree
Technical Field
The present application relates to the field of computer technologies, and in particular, to a vectorization representation method and apparatus for nodes in an abstract syntax tree.
Background
An Abstract Syntax Tree (AST), or syntax tree, is a tree representation of an abstract syntax structure of source code data written in a programming language, with each node of the tree representing a construct that appears in the source code data. In the existing vectorization representation method of the nodes in the abstract syntax tree, the sub-nodes of the nodes in the abstract syntax tree are usually directly coded to obtain the vectorization representation of the nodes in the abstract syntax tree. In practice, it is found that the existing vectorization representation method only uses its child nodes and discards sibling and grandchild nodes, resulting in loss of node information. Therefore, the existing vectorization representation method of the nodes in the abstract syntax tree cannot accurately carry out vectorization representation on the nodes in the abstract syntax tree.
Disclosure of Invention
The embodiments of the present application provide a method and an apparatus for vectorizing and representing nodes in an abstract syntax tree, which can comprehensively cover all nodes in the abstract syntax tree, and further accurately perform vectorization and representation on the nodes in the abstract syntax tree.
A first aspect of an embodiment of the present application provides a vectorization representation method for nodes in an abstract syntax tree, including:
acquiring an abstract syntax tree to be processed;
performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
generating a coding sequence to be processed according to the first sequence and the second sequence;
and processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
In the implementation process, firstly, an abstract syntax tree to be processed is obtained; then, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree again to obtain a second sequence; further, generating a coding sequence to be processed according to the first sequence and the second sequence; and finally, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree. Therefore, the method can completely cover all the nodes in the abstract syntax tree, and further accurately carry out vectorization representation on the nodes in the abstract syntax tree.
Further, the obtaining the abstract syntax tree to be processed includes:
acquiring source code data to be processed;
and analyzing the source code data to obtain the abstract syntax tree to be processed.
In the implementation process, the abstract syntax tree can be obtained by analyzing the source code data, and when the vectorization representation of the abstract syntax tree is carried out, the corresponding vectorization representation can be generated only by obtaining the source code data file, so that the method is simple and has wide application range.
Further, the generating a coding sequence to be processed according to the first sequence and the second sequence includes:
performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and coding the connecting sequence to obtain a coding sequence to be processed.
In the implementation process, the obtained connection sequence can cover the correlation between the brother node and the father-son node in the abstract syntax tree by connecting the first sequence and the second sequence, capture some structural rules existing between the nodes in the abstract syntax tree, and is favorable for accurately carrying out vectorization representation on the nodes in the abstract syntax tree.
Further, before the processing the to-be-processed coding sequence through the pre-constructed vectorization processing model to obtain the vectorization representation result of the node in the abstract syntax tree, the method further includes:
constructing an original processing model;
acquiring training data and preset model parameters for training the original processing model;
adjusting the original processing model through the preset model parameters to obtain an initial model;
and training the initial model through the training data to obtain a vectorization processing model.
In the implementation process, before the coding sequence to be processed is processed through the pre-constructed vectorization processing model, an original processing model needs to be constructed, and then parameter setting and training are performed on the original processing model through preset model parameters and training data, so that the vectorization processing model is obtained.
Further, the preset model parameters at least comprise a coding dimension value and a preset cost function of the coding sequence to be processed;
adjusting the original processing model through the preset model parameters to obtain an initial model, including:
setting the number of neurons of an output layer of each model unit in the original processing model as the coding dimension value to obtain an initial adjustment model;
and setting the cost function of the initial adjustment model as the preset cost function to obtain an initial model.
In the implementation process, the original processing model is adjusted through the preset model parameters, so that the accuracy of the model is favorably improved, and the accuracy of vectorization representation of the abstract syntax tree is favorably improved.
A second aspect of the embodiments of the present application provides a device for vectorizing a node in an abstract syntax tree, where the device for vectorizing a node in an abstract syntax tree includes:
the acquisition module is used for acquiring an abstract syntax tree to be processed;
the traversal module is used for performing breadth-first traversal on the abstract syntax tree to obtain a first sequence and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
the coding module is used for generating a coding sequence to be processed according to the first sequence and the second sequence;
and the model processing module is used for processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
In the implementation process, the acquisition module acquires an abstract syntax tree to be processed; then, the traversal module performs breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performs depth-first traversal on the abstract syntax tree again to obtain a second sequence; further, the coding module generates a coding sequence to be processed according to the first sequence and the second sequence; and finally, the model processing module processes the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree. Therefore, the method can completely cover all the nodes in the abstract syntax tree, and further accurately carry out vectorization representation on the nodes in the abstract syntax tree.
Further, the obtaining module comprises:
the acquisition submodule is used for acquiring source code data to be processed;
and the analysis submodule is used for analyzing the source code data to obtain the abstract syntax tree to be processed.
In the implementation process, the parsing submodule can parse the source code data to obtain the abstract syntax tree, and when vectorization representation of the abstract syntax tree is performed, the corresponding vectorization representation can be generated only by acquiring the source code data file through the acquisition submodule, so that the method is simple and has a wide application range.
Further, the encoding module includes:
the connection submodule is used for performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and the coding submodule is used for coding the connection sequence to obtain a coding sequence to be processed.
In the implementation process, the connection sub-module connects the first sequence and the second sequence, so that the obtained connection sequence can cover the correlation between the brother node and the father-son node in the abstract syntax tree, capture some structural rules existing between the nodes in the abstract syntax tree, and be beneficial to accurately carrying out vectorization representation on the nodes in the abstract syntax tree.
A third aspect of embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the vectorization representation method for nodes in an abstract syntax tree according to any one of the first aspect of embodiments of the present application.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the vectorization representation method for nodes in an abstract syntax tree according to any one of the first aspect of the embodiments of the present application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of a vectorization representation method for nodes in an abstract syntax tree according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a vectorization representation method for nodes in an abstract syntax tree according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of a vectorization representation apparatus for nodes in an abstract syntax tree according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of a vectorization representation apparatus for nodes in an abstract syntax tree according to a fourth embodiment of the present application;
fig. 5 is an expanded schematic view of an LSTM model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a flowchart illustrating a vectorization representation method for nodes in an abstract syntax tree according to an embodiment of the present disclosure. The vectorization representation method of the nodes in the abstract syntax tree comprises the following steps:
and S101, acquiring an abstract syntax tree to be processed.
In this embodiment, an execution subject of the method may be an electronic device such as a computer, a server, a smart phone, a tablet computer, and the like, which is not limited in this embodiment.
In the embodiment of the present application, an Abstract Syntax Tree (AST), also called Syntax Tree (Syntax Tree), is an Abstract representation of a Syntax structure of source code data. The abstract syntax tree represents the syntax structure of the programming language in the form of a tree, with each node on the tree representing a structure in the source code data.
In the embodiment of the present application, the source code data to be processed may be analyzed to obtain the abstract syntax tree to be processed, or the pre-stored abstract syntax tree to be processed may be directly obtained, which is not limited in this embodiment of the present application.
S102, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence.
In the embodiment of the present application, when breadth-first traversal is performed on the abstract syntax tree, a breadth-first search algorithm may be used, and specifically, the breadth-first search algorithm may be a Dijkstra single-source shortest path algorithm, a Prim minimum spanning tree algorithm, or the like, which is not limited in the embodiment of the present application.
In the embodiment of the present application, a Breadth First Search algorithm (BFS), also called a Breadth First Search algorithm, has the following main ideas: similar to the hierarchical traversal of the tree, the correlation between sibling nodes can be captured by traversing the abstract syntax tree through the BFS algorithm.
In the embodiment of the application, when the abstract syntax tree is subjected to depth-first traversal, a depth-first search algorithm can be adopted, and then the correlation between parent nodes and child nodes can be captured.
In the embodiment of the present application, a Depth First Search algorithm (DFS) has the following main ideas: firstly, taking an unvisited vertex as a starting vertex, and walking to the unvisited vertex along the edge of the current vertex; when there is no vertex that has not been visited, then go back to the previous vertex, and continue to probe other vertices until all vertices have been visited.
In the embodiment of the application, the first sequence and the second sequence are token sequences, tokens are played as character strings, and the token sequences are character string sequences obtained by traversing the abstract syntax tree.
After step S102, the method further includes the following steps:
and S103, generating a coding sequence to be processed according to the first sequence and the second sequence.
In the embodiment of the application, the first sequence and the second sequence are connected to obtain a connection sequence, and then the connection sequence is encoded to obtain a coding sequence to be processed.
In this embodiment of the present application, a one-hot encoding algorithm and the like may be used when encoding the connection sequence, which is not limited in this embodiment of the present application.
In the embodiment of the application, the One-hot coding algorithm is also called One-bit effective coding, and mainly adopts an N-bit state register to code N states, each state is provided with independent register bits, and only One bit is effective at any time. One-hot encoding is the representation of classification variables as binary vectors.
And S104, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
In the embodiment of the present application, the pre-constructed vectorization processing model is a neural network model, and may specifically be a Long Short-Term Memory network (LSTM) model, and the like, which is not limited in this embodiment of the present application.
In the embodiment of the present application, when the vectorization processing model is an LSTM model, the LSTM model is a time-cycle neural network, and includes an LSTM unit, and the output expectation of each time step of the LSTM unit is the token of the next time step. Referring to fig. 5, fig. 5 is an expanded view of an LSTM model according to an embodiment of the present application. As shown in fig. 5, after the LSTM model is developed, the output time steps of the LSTM unit include T1 time step, T2 time step, and T3 time step, and when training the LSTM, if a piece of training data is set to [ token1, token2, token3, token4], the output expectation of the time step of the LSTM unit T1 is token of the next time step (T2 time step), i.e., token2, similarly, the output expectation of the time step of T2 is token3, and the output expectation of the time step of T3 is token 4. And training the LSTM model through the training data to obtain a trained vectorization processing model.
It can be seen that, by implementing the vectorization representation method for the nodes in the abstract syntax tree described in this embodiment, all the nodes in the abstract syntax tree can be completely covered, and thus the nodes in the abstract syntax tree can be accurately vectorized and represented.
Example 2
Referring to fig. 2, fig. 2 is a flowchart illustrating a vectorization representation method for nodes in an abstract syntax tree according to an embodiment of the present application. As shown in fig. 2, the vectorized representation method of the node in the abstract syntax tree includes:
s201, constructing an original processing model.
In this embodiment of the present application, the original processing model may specifically be a Long Short-Term Memory network (LSTM) model, and the like, which is not limited in this embodiment of the present application.
S202, obtaining training data used for training an original processing model and preset model parameters, wherein the preset model parameters at least comprise a coding dimension value and a preset cost function of a coding sequence to be processed.
S203, setting the number of neurons of the output layer of each model unit in the original processing model as a coding dimension value to obtain an initial adjustment model.
In the embodiment of the present application, the primitive processing model includes a primitive LSTM unit, and the output expectation of each time step of the primitive LSTM unit is token of the next time step, and then the number of neurons in the output layer of each time step is set to be the same as the coding dimension value.
In the embodiment of the present application, a coding dimension value of a coding sequence to be processed is set to be M, and correspondingly, the number of neurons in an output layer at each time step is set to be M.
After step S203, the following steps are also included:
and S204, setting the cost function of the initial adjustment model as a preset cost function to obtain an initial model.
In the embodiment of the present application, the preset cost function is a preset loss function, which may specifically be a cross entropy function, and the like, and the embodiment of the present application is not limited thereto.
In the embodiment of the application, a softmax layer is added after an output layer of the initial adjustment model, and a cross entropy function is used as a cost function of the initial adjustment model.
In the embodiment of the application, a cross entropy function, namely a cross entropy loss function, introduces the cross entropy into the field of computational linguistics disambiguation, adopts the real semantics of a sentence as the prior information of a training set of the cross entropy, and adopts the semantics of machine translation as the posterior information of a test set. And calculating the cross entropy of the two, and guiding the identification and elimination of the ambiguity by the cross entropy.
In the embodiment of the present application, by implementing the steps S203 to S204, the original processing model can be adjusted by presetting the model parameters, so as to obtain the initial model.
S205, training the initial model through the training data to obtain a vectorization processing model.
In the embodiment of the present application, each column vector of the connection weight matrix between the input layer of the trained vectorization processing model and the hidden layer of the LSTM unit corresponds to a vectorization representation of token.
In this embodiment of the present application, when a one-hot encoding algorithm is used to encode a connection sequence, if the types of character strings (i.e., tokens) in the abstract syntax tree are M, the one-hot encoding corresponding to each token in the input layer of the vectorization processing model is an M-dimensional string vector. Further, assuming that the hidden layer of the LSTM unit has N neurons, the dimension of the connection weight matrix between the input layer of the vectorization processing model and the hidden layer of the LSTM unit is (N, M). Because each token of the input layer is a one-hot vector, the trained vectorization processing model compresses each M-dimensional one-hot vector into an N-dimensional column vector, which is a column vector in the connection weight matrix.
After step S205, the following steps are also included:
and S206, acquiring source code data to be processed.
In the embodiments of the present application, source code data (also referred to as a source program) refers to a series of human-readable computer language instructions.
In the embodiment of the present application, the source code data may be C language source code data, and the like, which is not limited to this embodiment of the present application.
And S207, analyzing the source code data to obtain the abstract syntax tree to be processed.
In the embodiment of the application, the source code data can be analyzed through the syntax analyzer to generate the corresponding abstract syntax tree. A parser (parser) usually appears as a component of a compiler or interpreter, which functions to perform syntax checking and to build a data structure (i.e. an abstract syntax tree) consisting of the input words.
In this embodiment of the application, when the source code data is C-language source code data, a C-language parser may be used to parse the source code data, and the C-language parser may be a C-language source code parser Pycparser, and the like, which is not limited to this embodiment of the application.
In the embodiment of the present application, by implementing the steps S206 to S207, the abstract syntax tree to be processed can be obtained.
After step S207, the following steps are also included:
and S208, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence.
S209, performing connection processing on the first sequence and the second sequence to obtain a connection sequence.
S210, coding the connection sequence to obtain a coding sequence to be processed.
In the embodiment of the present application, a one-hot encoding algorithm may be adopted to encode the connection sequence.
In the embodiment of the present application, the code sequence to be processed can be generated according to the first sequence and the second sequence by implementing the steps S209 to S210.
S211, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
It can be seen that, by implementing the vectorization representation method for the nodes in the abstract syntax tree described in this embodiment, all the nodes in the abstract syntax tree can be completely covered, and thus the nodes in the abstract syntax tree can be accurately vectorized and represented.
Example 3
Referring to fig. 3, fig. 3 is a schematic structural diagram of a vectorization representation apparatus for a node in an abstract syntax tree according to an embodiment of the present application. As shown in fig. 3, the vectorization representation apparatus of the node in the abstract syntax tree includes:
an obtaining module 310, configured to obtain an abstract syntax tree to be processed.
And the traversal module 320 is configured to perform breadth-first traversal on the abstract syntax tree to obtain a first sequence, and perform depth-first traversal on the abstract syntax tree to obtain a second sequence.
And the encoding module 330 is configured to generate a to-be-processed encoding sequence according to the first sequence and the second sequence.
The model processing module 340 is configured to process the coding sequence to be processed through a pre-constructed vectorization processing model, so as to obtain a vectorization representation result of a node in the abstract syntax tree.
In the embodiment of the present application, for the explanation of the vectorization representation apparatus for nodes in the abstract syntax tree, reference may be made to the description in embodiment 1 or embodiment 2, and details are not repeated in this embodiment.
It can be seen that, by implementing the vectorization representation apparatus for nodes in the abstract syntax tree described in this embodiment, all the nodes in the abstract syntax tree can be fully covered, and thus, the vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
Example 4
Referring to fig. 4, fig. 4 is a schematic structural diagram of a vectorization representation apparatus for a node in an abstract syntax tree according to an embodiment of the present disclosure. The vectorization representation device of the node in the abstract syntax tree shown in fig. 4 is optimized by the vectorization representation device of the node in the abstract syntax tree shown in fig. 3. As shown in fig. 4, the obtaining module 310 includes:
the obtaining submodule 311 is configured to obtain source code data to be processed.
And the parsing submodule 312 is configured to parse the source code data to obtain an abstract syntax tree to be processed.
As an alternative embodiment, the encoding module 330 includes:
the connection submodule 331 is configured to perform connection processing on the first sequence and the second sequence to obtain a connection sequence.
And the coding submodule 332 is used for coding the connection sequence to obtain a coding sequence to be processed.
As an optional implementation, the building module 350 is configured to build an original processing model before the to-be-processed coding sequence is processed through a pre-built vectorization processing model to obtain a vectorization representation result of a node in the abstract syntax tree.
The parameter obtaining module 360 is configured to obtain training data for training the raw processing model and preset model parameters.
And an adjusting module 370, configured to adjust the original processing model by presetting model parameters, so as to obtain an initial model.
And the training module 380 is configured to train the initial model through the training data to obtain a vectorization processing model.
In the embodiment of the application, the preset model parameters at least include a coding dimension value and a preset cost function of the coding sequence to be processed.
As an alternative embodiment, the adjusting module 370 includes:
the first setting submodule 371 is configured to set the number of neurons in the output layer of each model unit in the original processing model as a coding dimension value, so as to obtain an initial adjustment model.
And a second setting submodule 372, which sets the cost function of the initial adjustment model as a preset cost function to obtain an initial model.
In the embodiment of the present application, for the explanation of the vectorization representation apparatus for nodes in the abstract syntax tree, reference may be made to the description in embodiment 1 or embodiment 2, and details are not repeated in this embodiment.
It can be seen that, by implementing the vectorization representation apparatus for nodes in the abstract syntax tree described in this embodiment, all the nodes in the abstract syntax tree can be fully covered, and thus, the vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
An embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute a vectorization representation method for a node in an abstract syntax tree in any one of embodiment 1 or embodiment 2 of the present application.
An embodiment of the present application provides a computer-readable storage medium, which stores computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform a vectorization representation method for a node in an abstract syntax tree according to any one of embodiment 1 or embodiment 2 of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for vectorized representation of nodes in an abstract syntax tree, comprising:
acquiring an abstract syntax tree to be processed;
performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
generating a coding sequence to be processed according to the first sequence and the second sequence;
and processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
2. The method according to claim 1, wherein the obtaining the abstract syntax tree to be processed comprises:
acquiring source code data to be processed;
and analyzing the source code data to obtain the abstract syntax tree to be processed.
3. The method according to claim 1, wherein the generating a sequence of coding to be processed according to the first sequence and the second sequence comprises:
performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and coding the connecting sequence to obtain a coding sequence to be processed.
4. The method as claimed in claim 1, wherein before the processing the coding sequence to be processed through the pre-constructed vectorization processing model to obtain the vectorization representation result of the node in the abstract syntax tree, the method further comprises:
constructing an original processing model;
acquiring training data and preset model parameters for training the original processing model;
adjusting the original processing model through the preset model parameters to obtain an initial model;
and training the initial model through the training data to obtain a vectorization processing model.
5. The method according to claim 4, wherein the predetermined model parameters at least include a coding dimension value and a predetermined cost function of the coding sequence to be processed;
adjusting the original processing model through the preset model parameters to obtain an initial model, including:
setting the number of neurons of an output layer of each model unit in the original processing model as the coding dimension value to obtain an initial adjustment model;
and setting the cost function of the initial adjustment model as the preset cost function to obtain an initial model.
6. An apparatus for vectorizing representation of a node in an abstract syntax tree, the apparatus comprising:
the acquisition module is used for acquiring an abstract syntax tree to be processed;
the traversal module is used for performing breadth-first traversal on the abstract syntax tree to obtain a first sequence and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
the coding module is used for generating a coding sequence to be processed according to the first sequence and the second sequence;
and the model processing module is used for processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
7. The apparatus for vectorized representation of nodes in an abstract syntax tree as claimed in claim 6, wherein said retrieving module comprises:
the acquisition submodule is used for acquiring source code data to be processed;
and the analysis submodule is used for analyzing the source code data to obtain the abstract syntax tree to be processed.
8. The apparatus for vectorized representation of nodes in an abstract syntax tree as claimed in claim 6, wherein said encoding module comprises:
the connection submodule is used for performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and the coding submodule is used for coding the connection sequence to obtain a coding sequence to be processed.
9. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the vectorized representation method of nodes in an abstract syntax tree according to any one of claims 1 to 5.
10. A readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method of vectorized representation of nodes in an abstract syntax tree according to any one of claims 1 to 5.
CN202010907349.9A 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree Active CN112035099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010907349.9A CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010907349.9A CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Publications (2)

Publication Number Publication Date
CN112035099A true CN112035099A (en) 2020-12-04
CN112035099B CN112035099B (en) 2024-03-15

Family

ID=73591046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010907349.9A Active CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Country Status (1)

Country Link
CN (1) CN112035099B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113797545A (en) * 2021-08-25 2021-12-17 广州三七网络科技有限公司 Game script processing method and device, computer equipment and storage medium
CN114347039A (en) * 2022-02-14 2022-04-15 北京航空航天大学杭州创新研究院 Robot control method and related device
CN117171053A (en) * 2023-11-01 2023-12-05 睿思芯科(深圳)技术有限公司 Test method, system and related equipment for vectorized programming

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516041A (en) * 2017-08-17 2017-12-26 北京安普诺信息技术有限公司 WebShell detection methods and its system based on deep neural network
CN108369500A (en) * 2015-12-14 2018-08-03 数据仓库投资有限公司 Extended field is specialized
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729326B (en) * 2017-09-25 2020-12-25 沈阳航空航天大学 Multi-BiRNN coding-based neural machine translation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108369500A (en) * 2015-12-14 2018-08-03 数据仓库投资有限公司 Extended field is specialized
CN107516041A (en) * 2017-08-17 2017-12-26 北京安普诺信息技术有限公司 WebShell detection methods and its system based on deep neural network
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113797545A (en) * 2021-08-25 2021-12-17 广州三七网络科技有限公司 Game script processing method and device, computer equipment and storage medium
CN114347039A (en) * 2022-02-14 2022-04-15 北京航空航天大学杭州创新研究院 Robot control method and related device
CN114347039B (en) * 2022-02-14 2023-09-22 北京航空航天大学杭州创新研究院 Robot look-ahead control method and related device
CN117171053A (en) * 2023-11-01 2023-12-05 睿思芯科(深圳)技术有限公司 Test method, system and related equipment for vectorized programming
CN117171053B (en) * 2023-11-01 2024-02-20 睿思芯科(深圳)技术有限公司 Test method, system and related equipment for vectorized programming

Also Published As

Publication number Publication date
CN112035099B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
Shido et al. Automatic source code summarization with extended tree-lstm
CN107516041B (en) WebShell detection method and system based on deep neural network
Gaddy et al. What's going on in neural constituency parsers? an analysis
Locascio et al. Neural generation of regular expressions from natural language with minimal domain knowledge
CN112035099B (en) Vectorization representation method and device for nodes in abstract syntax tree
CN111680494B (en) Similar text generation method and device
CN113901799B (en) Model training method, text prediction method, model training device, text prediction device, electronic equipment and medium
Lin et al. Critical behavior from deep dynamics: a hidden dimension in natural language
CN112035165B (en) Code clone detection method and system based on isomorphic network
CN107451106A (en) Text method and device for correcting, electronic equipment
CN114936287A (en) Knowledge injection method for pre-training language model and corresponding interactive system
CN114489669A (en) Python language code fragment generation method based on graph learning
CN114201406B (en) Code detection method, system, equipment and storage medium based on open source component
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN112579469A (en) Source code defect detection method and device
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN114064117A (en) Code clone detection method and system based on byte code and neural network
CN116629211B (en) Writing method and system based on artificial intelligence
Zhang et al. Disentangled representation for long-tail senses of word sense disambiguation
CN114757181B (en) Method and device for training and extracting event of end-to-end event extraction model based on prior knowledge
WO2019163752A1 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
CN114519353B (en) Model training method, emotion message generation method and device, equipment and medium
Črepinšek et al. Inferring context-free grammars for domain-specific languages
CN115114627B (en) Malicious software detection method and device
Fang et al. Adaptive Code Completion with Meta-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant