CN112035099B - Vectorization representation method and device for nodes in abstract syntax tree - Google Patents

Vectorization representation method and device for nodes in abstract syntax tree Download PDF

Info

Publication number
CN112035099B
CN112035099B CN202010907349.9A CN202010907349A CN112035099B CN 112035099 B CN112035099 B CN 112035099B CN 202010907349 A CN202010907349 A CN 202010907349A CN 112035099 B CN112035099 B CN 112035099B
Authority
CN
China
Prior art keywords
syntax tree
sequence
abstract syntax
nodes
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010907349.9A
Other languages
Chinese (zh)
Other versions
CN112035099A (en
Inventor
董叶豪
刘盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202010907349.9A priority Critical patent/CN112035099B/en
Publication of CN112035099A publication Critical patent/CN112035099A/en
Application granted granted Critical
Publication of CN112035099B publication Critical patent/CN112035099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a vectorization representation method and device for nodes in an abstract syntax tree, which relate to the technical field of computers and comprise the following steps: firstly, obtaining an abstract syntax tree to be processed; then performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree again to obtain a second sequence; further, generating a coding sequence to be processed according to the first sequence and the second sequence; and finally, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree. Therefore, the method can comprehensively cover all nodes in the abstract syntax tree, and further accurately vectorize the nodes in the abstract syntax tree.

Description

Vectorization representation method and device for nodes in abstract syntax tree
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a vectorization representation method and apparatus for nodes in an abstract syntax tree.
Background
An Abstract Syntax Tree (AST) or syntax tree is a tree representation of an abstract syntax structure of source code data written in a programming language, each node of the tree representing a construct that appears in the source code data. The existing vectorization representation method of the nodes in the abstract syntax tree generally directly encodes the child nodes of the nodes in the abstract syntax tree to obtain vectorization representation of the nodes in the abstract syntax tree. In practice, it has been found that existing vectorization representation methods use only its child nodes and discard sibling and grandchild nodes, resulting in loss of node information. Therefore, the existing vectorization representation method of the nodes in the abstract syntax tree cannot accurately vectorize the nodes in the abstract syntax tree.
Disclosure of Invention
The embodiment of the application aims to provide a vectorization representation method and device for nodes in an abstract syntax tree, which can fully cover all the nodes in the abstract syntax tree and further accurately vectorize the nodes in the abstract syntax tree.
An embodiment of the present application provides a vectorized representation method of nodes in an abstract syntax tree, including:
acquiring an abstract syntax tree to be processed;
performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
generating a coding sequence to be processed according to the first sequence and the second sequence;
and processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
In the implementation process, firstly, an abstract syntax tree to be processed is obtained; then performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree again to obtain a second sequence; further, generating a coding sequence to be processed according to the first sequence and the second sequence; and finally, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree. Therefore, the method can comprehensively cover all nodes in the abstract syntax tree, and further accurately vectorize the nodes in the abstract syntax tree.
Further, the obtaining the abstract syntax tree to be processed includes:
acquiring source code data to be processed;
and analyzing the source code data to obtain an abstract syntax tree to be processed.
In the implementation process, the abstract syntax tree can be obtained by analyzing the source code data, and when the vectorization representation of the abstract syntax tree is carried out, the corresponding vectorization representation can be generated only by acquiring the source code data file, so that the method is simple and wide in application range.
Further, the generating a coding sequence to be processed according to the first sequence and the second sequence includes:
performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and carrying out coding treatment on the connection sequence to obtain a coding sequence to be treated.
In the implementation process, the first sequence and the second sequence are connected, the obtained connection sequence can cover the correlation between brother nodes and father and son nodes in the abstract syntax tree, capture some structural rules existing between nodes in the abstract syntax tree, and facilitate the accurate vectorization representation of the nodes in the abstract syntax tree.
Further, before the coding sequence to be processed is processed through the pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree, the method further includes:
constructing an original processing model;
acquiring training data and preset model parameters for training the original processing model;
adjusting the original processing model through the preset model parameters to obtain an initial model;
and training the initial model through the training data to obtain a vectorization processing model.
In the implementation process, before the coding sequence to be processed is processed through the vectorization processing model which is built in advance, an original processing model is also required to be built, and then parameter setting and training are carried out on the original processing model through preset model parameters and training data, so that the vectorization processing model is obtained.
Further, the preset model parameters at least comprise a coding dimension value and a preset cost function of the coding sequence to be processed;
adjusting the original processing model through the preset model parameters to obtain an initial model, wherein the method comprises the following steps:
setting the number of neurons of an output layer of each model unit in the original processing model as the coding dimension value to obtain an initial adjustment model;
setting the cost function of the initial adjustment model as the preset cost function to obtain an initial model.
In the implementation process, the original processing model is adjusted through the preset model parameters, so that the accuracy of the model is improved, and the accuracy of vectorization representation of the abstract syntax tree is improved.
A second aspect of the present embodiment provides a vectorized representation apparatus of a node in an abstract syntax tree, where the vectorized representation apparatus of a node in an abstract syntax tree includes:
the acquisition module is used for acquiring an abstract syntax tree to be processed;
the traversing module is used for performing breadth-first traversing on the abstract syntax tree to obtain a first sequence, and performing depth-first traversing on the abstract syntax tree to obtain a second sequence;
the coding module is used for generating a coding sequence to be processed according to the first sequence and the second sequence;
and the model processing module is used for processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree.
In the implementation process, an acquisition module acquires an abstract syntax tree to be processed; then the traversing module conducts breadth-first traversing on the abstract syntax tree to obtain a first sequence, and conducts depth-first traversing on the abstract syntax tree again to obtain a second sequence; further, the coding module generates a coding sequence to be processed according to the first sequence and the second sequence; and finally, the model processing module processes the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree. Therefore, the method can comprehensively cover all nodes in the abstract syntax tree, and further accurately vectorize the nodes in the abstract syntax tree.
Further, the acquisition module includes:
the acquisition sub-module is used for acquiring source code data to be processed;
and the analysis sub-module is used for analyzing the source code data to obtain an abstract syntax tree to be processed.
In the implementation process, the analysis submodule can obtain the abstract syntax tree by analyzing the source code data, and when the vectorization representation of the abstract syntax tree is carried out, the corresponding vectorization representation can be generated only by acquiring the source code data file through the acquisition submodule, so that the method is simple and wide in application range.
Further, the encoding module includes:
the connection submodule is used for carrying out connection processing on the first sequence and the second sequence to obtain a connection sequence;
and the coding submodule is used for carrying out coding treatment on the connecting sequence to obtain a coding sequence to be treated.
In the implementation process, the connection sub-module can cover the correlation between the brother node and the father-son node in the abstract syntax tree by connecting the first sequence and the second sequence, capture some structural rules existing between the nodes in the abstract syntax tree, and facilitate the accurate vectorization representation of the nodes in the abstract syntax tree.
A third aspect of the embodiments of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform the vectorized representation method of the nodes in the abstract syntax tree according to any one of the first aspect of the embodiments of the present application.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing computer program instructions that, when read and executed by a processor, perform the method for vectorizing nodes in an abstract syntax tree according to any one of the first aspects of the embodiments of the present application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a vectorization representation method of nodes in an abstract syntax tree according to an embodiment of the present application;
fig. 2 is a flowchart of a vectorization representation method of nodes in an abstract syntax tree according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of a vectorization representation apparatus for nodes in an abstract syntax tree according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of a vectorization representation apparatus for nodes in an abstract syntax tree according to a fourth embodiment of the present application;
fig. 5 is an expanded schematic diagram of an LSTM model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a method for vectorizing a node in an abstract syntax tree according to an embodiment of the present application. The vectorization representation method of the nodes in the abstract syntax tree comprises the following steps:
s101, acquiring an abstract syntax tree to be processed.
In this embodiment of the present application, the execution body of the method may be an electronic device such as a computer, a server, a smart phone, a tablet computer, and the like, which is not limited in this embodiment.
In the embodiment of the application, an abstract Syntax tree (Abstract Syntax Tree, AST), also called Syntax tree (syncax tree), is an abstract representation of the source code data Syntax structure. The abstract syntax tree represents the syntax structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code data.
In the embodiment of the present application, the source code data to be processed may be parsed to obtain the abstract syntax tree to be processed, or the pre-stored abstract syntax tree to be processed may be directly obtained, which is not limited to the embodiment of the present application.
S102, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence.
In this embodiment of the present application, breadth-first traversal is performed on the abstract syntax tree, and a breadth-first search algorithm may be adopted, and specifically, the breadth-first search algorithm may be Dijkstra single-source shortest path algorithm, prim minimum spanning tree algorithm, or the like, which is not limited to this embodiment of the present application.
In the embodiment of the application, breadth first search algorithm (Breadth First Search, BFS), also called breadth first search algorithm, the main idea of BFS algorithm is: similar to the tree layer sequence traversal, traversing the abstract syntax tree through the BFS algorithm can capture the correlation between sibling nodes.
In the embodiment of the application, depth-first traversal is performed on the abstract syntax tree, and a depth-first search algorithm can be adopted, so that correlation between parent-child nodes can be captured.
In the embodiment of the application, the depth-first search algorithm (Depth First Search, DFS) has the main ideas that: firstly, taking an unviewed vertex as a starting vertex, and walking to the unviewed vertex along the edge of the current vertex; when no unviewed vertex exists, the method returns to the last vertex, and continues to test other vertices until all the vertices are visited.
In the embodiment of the application, the first sequence and the second sequence are token sequences, the token is played as a character string, and the token sequences are character string sequences obtained by traversing the abstract syntax tree.
After step S102, the method further includes the steps of:
s103, generating a coding sequence to be processed according to the first sequence and the second sequence.
In the embodiment of the application, the first sequence and the second sequence are connected to obtain a connection sequence, and then the connection sequence is encoded to obtain a coding sequence to be processed.
In the embodiment of the present application, when the connection sequence is encoded, a one-hot encoding algorithm may be used, which is not limited to this embodiment of the present application.
In the embodiment of the application, the One-hot encoding algorithm is also called One-bit valid encoding, and mainly uses an N-bit state register to encode N states, where each state is represented by its independent register bit, and only One bit is valid at any time. One-hot encoding is a representation of classification variables as binary vectors.
S104, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree.
In this embodiment of the present application, the pre-constructed vectorization processing model is a neural network model, and may specifically be a Long Short-Term Memory (LSTM) model, etc., which is not limited to this embodiment of the present application.
In this embodiment, when the vectorization processing model is an LSTM model, the LSTM model is a time-cycled neural network, including LSTM units, where the output of each time step of the LSTM unit is expected to be a token of the next time step. Referring to fig. 5, fig. 5 is an expanded schematic diagram of an LSTM model according to an embodiment of the present application. As shown in FIG. 5, after the LSTM model is expanded, the output time step of the LSTM unit includes a T1 time step, a T2 time step and a T3 time step, when the LSTM is trained, if a training data is [ token1, token2, token3 and token4], the output of the T1 time step of the LSTM unit is expected to be the token of the next time step (T2 time step), namely token2, similarly, the output of the T2 time step is expected to be token3, and the output of the T3 time step is expected to be token4. And training the LSTM model through training data to obtain a trained vectorization processing model.
Therefore, by implementing the vectorization representation method of the nodes in the abstract syntax tree described in the embodiment, all the nodes in the abstract syntax tree can be covered in a full-scale manner, and further vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
Example 2
Referring to fig. 2, fig. 2 is a flowchart of a vectorization representation method of nodes in an abstract syntax tree according to an embodiment of the present application. As shown in fig. 2, the vectorization representation method of the nodes in the abstract syntax tree includes:
s201, constructing an original processing model.
In this embodiment of the present application, the original processing model may specifically be a Long Short-Term Memory (LSTM) model, etc., which is not limited to this embodiment of the present application.
S202, training data for training an original processing model and preset model parameters are obtained, wherein the preset model parameters at least comprise a coding dimension value and a preset cost function of a coding sequence to be processed.
S203, setting the number of neurons of an output layer of each model unit in the original processing model as a coding dimension value to obtain an initial adjustment model.
In this embodiment of the present application, the original processing model includes an original LSTM unit, and the output of each time step of the original LSTM unit is expected to be a token of the next time step, and then the number of neurons of the output layer of each time step is set to be the same as the encoded dimension value.
In this embodiment of the present application, the coding dimension value for performing coding processing on the coding sequence to be processed is set to be M, and correspondingly, the number of neurons in the output layer in each time step is set to be M.
After step S203, the method further includes the steps of:
s204, setting the cost function of the initial adjustment model as a preset cost function to obtain an initial model.
In this embodiment of the present application, the preset cost function, that is, the preset loss function, may specifically be a cross entropy function, etc., which is not limited to this embodiment of the present application.
In the embodiment of the application, a softmax layer is added after the output layer of the initial adjustment model, and a cross entropy function is used as a cost function of the initial adjustment model.
In the embodiment of the application, the cross entropy function, namely the cross entropy loss function, introduces the cross entropy into the field of computational linguistic disambiguation, adopts the real semantics of sentences as the priori information of a training set of the cross entropy, and adopts the semantics of machine translation as the posterior information of a test set. And calculating the cross entropy of the two, and guiding the identification and elimination of ambiguity by the cross entropy.
In the embodiment of the present application, the steps S203 to S204 are implemented, and the original processing model can be adjusted by the preset model parameters to obtain the initial model.
S205, training the initial model through training data to obtain a vectorization processing model.
In this embodiment of the present application, each column vector of the connection weight matrix between the input layer of the trained vectorization processing model and the hidden layer of the LSTM unit corresponds to a vectorization representation of a token.
In this embodiment of the present application, when a one-hot encoding algorithm is used to encode the connection sequence, if the types of character strings (i.e. token) in the abstract syntax tree are M types, then the one-hot encoding corresponding to each token in the input layer of the vectorization processing model is an M-dimensional column vector. Further, if the hidden layer of the LSTM unit has N neurons, the dimension of the connection weight matrix between the input layer of the vectorization processing model and the hidden layer of the LSTM unit is (N, M). Because each token of the input layer is a one-hot vector, the trained vectorization processing model compresses each M-dimensional one-hot vector into an N-dimensional column vector, which is a column vector in the connection weight matrix.
After step S205, the method further includes the steps of:
s206, acquiring the source code data to be processed.
In the present embodiment, source code data (also referred to as a source program) refers to a series of human-readable computer language instructions.
In this embodiment of the present application, the source code data may be C language source code data, etc., which is not limited to this embodiment of the present application.
S207, analyzing the source code data to obtain an abstract syntax tree to be processed.
In the embodiment of the application, the source code data can be analyzed and processed through a grammar analyzer to generate a corresponding abstract grammar tree. A parser (parser) usually appears as a component of a compiler or interpreter, which functions to perform a grammar check and to construct a data structure (i.e. an abstract syntax tree) consisting of the entered words.
In this embodiment of the present application, when the source code data is C language source code data, a C language parser may be used to parse the source code data, where the C language parser may be a C language source code analysis library Pycparser, etc., and the embodiment of the present application is not limited to this.
In the embodiment of the present application, the steps S206 to S207 are implemented, so that the abstract syntax tree to be processed can be obtained.
After step S207, the following steps are further included:
s208, performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence.
S209, connecting the first sequence and the second sequence to obtain a connecting sequence.
S210, coding the connection sequence to obtain a coding sequence to be processed.
In the embodiment of the application, the connection sequence can be encoded by adopting a one-hot encoding algorithm.
In the embodiment of the present application, the steps S209 to S210 are performed, and the code sequence to be processed can be generated according to the first sequence and the second sequence.
S211, processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree.
Therefore, by implementing the vectorization representation method of the nodes in the abstract syntax tree described in the embodiment, all the nodes in the abstract syntax tree can be covered in a full-scale manner, and further vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
Example 3
Referring to fig. 3, fig. 3 is a schematic structural diagram of a vectorized representation device of nodes in an abstract syntax tree according to an embodiment of the present application. As shown in fig. 3, the vectorized representation device of the nodes in the abstract syntax tree includes:
an obtaining module 310, configured to obtain an abstract syntax tree to be processed.
The traversing module 320 is configured to perform breadth-first traversing on the abstract syntax tree to obtain a first sequence, and perform depth-first traversing on the abstract syntax tree to obtain a second sequence.
The encoding module 330 is configured to generate a to-be-processed encoding sequence according to the first sequence and the second sequence.
The model processing module 340 is configured to process the coding sequence to be processed through a pre-constructed vectorization processing model, so as to obtain a vectorization representation result of the nodes in the abstract syntax tree.
In this embodiment of the present application, the explanation of the vectorized representation device of the node in the abstract syntax tree may refer to the description in embodiment 1 or embodiment 2, and the description is not repeated in this embodiment.
Therefore, the vectorization representation device for the nodes in the abstract syntax tree described in the embodiment can be implemented to cover all the nodes in the abstract syntax tree in a full-scale manner, so that vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
Example 4
Referring to fig. 4 together, fig. 4 is a schematic structural diagram of a vectorization representation apparatus for nodes in an abstract syntax tree according to an embodiment of the present application. The vectorization representation device of the nodes in the abstract syntax tree shown in fig. 4 is optimized by the vectorization representation device of the nodes in the abstract syntax tree shown in fig. 3. As shown in fig. 4, the acquisition module 310 includes:
the obtaining sub-module 311 is configured to obtain source code data to be processed.
The parsing sub-module 312 is configured to parse the source code data to obtain an abstract syntax tree to be processed.
As an alternative embodiment, the encoding module 330 includes:
and the connection submodule 331 is used for performing connection processing on the first sequence and the second sequence to obtain a connection sequence.
The coding submodule 332 is configured to perform coding processing on the connection sequence to obtain a coding sequence to be processed.
As an optional implementation manner, the construction module 350 is configured to construct an original processing model before processing the coding sequence to be processed by using a pre-constructed vectorization processing model to obtain a vectorization representation result of the nodes in the abstract syntax tree.
The parameter acquisition module 360 is configured to acquire training data for training the raw processing model and preset model parameters.
The adjusting module 370 is configured to adjust the original processing model through preset model parameters, so as to obtain an initial model.
The training module 380 is configured to train the initial model through training data to obtain a vectorized processing model.
In this embodiment of the present application, the preset model parameters at least include a coding dimension value of the coding sequence to be processed and a preset cost function.
As an alternative embodiment, the adjustment module 370 includes:
the first setting submodule 371 is configured to set the number of neurons of an output layer of each model unit in the original processing model as a coding dimension value, so as to obtain an initial adjustment model.
The second setting submodule 372 sets the cost function of the initial adjustment model as a preset cost function to obtain an initial model.
In this embodiment of the present application, the explanation of the vectorized representation device of the node in the abstract syntax tree may refer to the description in embodiment 1 or embodiment 2, and the description is not repeated in this embodiment.
Therefore, the vectorization representation device for the nodes in the abstract syntax tree described in the embodiment can be implemented to cover all the nodes in the abstract syntax tree in a full-scale manner, so that vectorization representation can be accurately performed on the nodes in the abstract syntax tree.
An embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform a vectorized representation method of a node in an abstract syntax tree according to any one of embodiment 1 or embodiment 2 of the present application.
Embodiments of the present application provide a computer readable storage medium storing computer program instructions that, when read and executed by a processor, perform the method of vectorizing the nodes in the abstract syntax tree of any one of embodiments 1 or 2 of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (8)

1. A method for vectorizing representation of nodes in an abstract syntax tree, comprising:
acquiring an abstract syntax tree to be processed;
performing breadth-first traversal on the abstract syntax tree to obtain a first sequence, and performing depth-first traversal on the abstract syntax tree to obtain a second sequence;
generating a coding sequence to be processed according to a one-hot coding algorithm, the first sequence and the second sequence;
processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree;
wherein the generating a coding sequence to be processed according to the one-hot coding algorithm, the first sequence and the second sequence includes:
performing connection processing on the first sequence and the second sequence to obtain a connection sequence;
and carrying out coding treatment on the connection sequence according to a one-hot coding algorithm to obtain a coding sequence to be treated.
2. The method for vectorizing representation of nodes in an abstract syntax tree according to claim 1, wherein said obtaining an abstract syntax tree to be processed comprises:
acquiring source code data to be processed;
and analyzing the source code data to obtain an abstract syntax tree to be processed.
3. The method for vectorizing the nodes in the abstract syntax tree according to claim 1, wherein before the coding sequence to be processed is processed by the vectorizing processing model constructed in advance, the method further comprises:
constructing an original processing model;
acquiring training data and preset model parameters for training the original processing model;
adjusting the original processing model through the preset model parameters to obtain an initial model;
and training the initial model through the training data to obtain a vectorization processing model.
4. A method of vectorizing representation of nodes in an abstract syntax tree according to claim 3, wherein the pre-set model parameters comprise at least a coding dimension value and a pre-set cost function of the coding sequence to be processed;
adjusting the original processing model through the preset model parameters to obtain an initial model, wherein the method comprises the following steps:
setting the number of neurons of an output layer of each model unit in the original processing model as the coding dimension value to obtain an initial adjustment model;
setting the cost function of the initial adjustment model as the preset cost function to obtain an initial model.
5. A vectorized representation apparatus of nodes in an abstract syntax tree, the vectorized representation apparatus of nodes in the abstract syntax tree comprising:
the acquisition module is used for acquiring an abstract syntax tree to be processed;
the traversing module is used for performing breadth-first traversing on the abstract syntax tree to obtain a first sequence, and performing depth-first traversing on the abstract syntax tree to obtain a second sequence;
the coding module is used for generating a coding sequence to be processed according to a one-hot coding algorithm, the first sequence and the second sequence;
the model processing module is used for processing the coding sequence to be processed through a pre-constructed vectorization processing model to obtain vectorization representation results of nodes in the abstract syntax tree;
wherein the encoding module comprises:
the connection submodule is used for carrying out connection processing on the first sequence and the second sequence to obtain a connection sequence;
and the coding submodule is used for carrying out coding treatment on the connecting sequence to obtain a coding sequence to be treated.
6. The apparatus for vectorized representation of nodes in an abstract syntax tree according to claim 5, wherein said obtaining means comprises:
the acquisition sub-module is used for acquiring source code data to be processed;
and the analysis sub-module is used for analyzing the source code data to obtain an abstract syntax tree to be processed.
7. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the vectorized representation of nodes in an abstract syntax tree according to any one of claims 1 to 4.
8. A readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the vectorized representation of nodes in an abstract syntax tree according to any one of claims 1 to 4.
CN202010907349.9A 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree Active CN112035099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010907349.9A CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010907349.9A CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Publications (2)

Publication Number Publication Date
CN112035099A CN112035099A (en) 2020-12-04
CN112035099B true CN112035099B (en) 2024-03-15

Family

ID=73591046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010907349.9A Active CN112035099B (en) 2020-09-01 2020-09-01 Vectorization representation method and device for nodes in abstract syntax tree

Country Status (1)

Country Link
CN (1) CN112035099B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113797545A (en) * 2021-08-25 2021-12-17 广州三七网络科技有限公司 Game script processing method and device, computer equipment and storage medium
CN114347039B (en) * 2022-02-14 2023-09-22 北京航空航天大学杭州创新研究院 Robot look-ahead control method and related device
CN117171053B (en) * 2023-11-01 2024-02-20 睿思芯科(深圳)技术有限公司 Test method, system and related equipment for vectorized programming

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516041A (en) * 2017-08-17 2017-12-26 北京安普诺信息技术有限公司 WebShell detection methods and its system based on deep neural network
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
CN108369500A (en) * 2015-12-14 2018-08-03 数据仓库投资有限公司 Extended field is specialized
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108369500A (en) * 2015-12-14 2018-08-03 数据仓库投资有限公司 Extended field is specialized
CN107516041A (en) * 2017-08-17 2017-12-26 北京安普诺信息技术有限公司 WebShell detection methods and its system based on deep neural network
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
CN109582352A (en) * 2018-10-19 2019-04-05 北京硅心科技有限公司 A kind of code completion method and system based on double AST sequences
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵彦博 ; 张星 ; 周毅楷 ; 邹德昊 ; .一种优化GCC抽象语法树的方法.电子技术与软件工程.(06),第239-241页. *

Also Published As

Publication number Publication date
CN112035099A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112035099B (en) Vectorization representation method and device for nodes in abstract syntax tree
Gaddy et al. What's going on in neural constituency parsers? an analysis
CN107516041B (en) WebShell detection method and system based on deep neural network
CN114169330B (en) Chinese named entity recognition method integrating time sequence convolution and transform encoder
CN111680494B (en) Similar text generation method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113901799B (en) Model training method, text prediction method, model training device, text prediction device, electronic equipment and medium
CN113190849A (en) Webshell script detection method and device, electronic equipment and storage medium
CN112579469A (en) Source code defect detection method and device
CN112035165A (en) Code clone detection method and system based on homogeneous network
CN114489669A (en) Python language code fragment generation method based on graph learning
CN112580346A (en) Event extraction method and device, computer equipment and storage medium
CN114936287A (en) Knowledge injection method for pre-training language model and corresponding interactive system
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN115329766B (en) Named entity identification method based on dynamic word information fusion
CN111813923A (en) Text summarization method, electronic device and storage medium
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN110737469A (en) Source code similarity evaluation method based on semantic information on functional granularities
CN113065322B (en) Code segment annotation generation method and system and readable storage medium
CN116629211B (en) Writing method and system based on artificial intelligence
Wakchaure et al. A scheme of answer selection in community question answering using machine learning techniques
CN114519353B (en) Model training method, emotion message generation method and device, equipment and medium
JP2019144844A (en) Morphological analysis learning device, morphological analysis device, method and program
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114691196A (en) Code defect detection method and device for dynamic language and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant