CN113238797A - Code feature extraction method and system based on hierarchical comparison learning - Google Patents

Code feature extraction method and system based on hierarchical comparison learning Download PDF

Info

Publication number
CN113238797A
CN113238797A CN202110411676.XA CN202110411676A CN113238797A CN 113238797 A CN113238797 A CN 113238797A CN 202110411676 A CN202110411676 A CN 202110411676A CN 113238797 A CN113238797 A CN 113238797A
Authority
CN
China
Prior art keywords
matrix
node
code
ast tree
ast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110411676.XA
Other languages
Chinese (zh)
Inventor
吕晨
王潇
高学剑
吴琼
马正
高曰秀
李季
吕蕾
刘弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202110411676.XA priority Critical patent/CN113238797A/en
Publication of CN113238797A publication Critical patent/CN113238797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/73Program documentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a code feature extraction method and a system based on hierarchical comparison learning, which comprises the following steps: acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree; establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X'; inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; the syntactic information of the program is extracted through the AST, the analyzed AST is classified according to the labels given by the node levels, the single-label multi-classification relation is established, and the structural information of the program is fully mined, so that the expression capability of the generated model is more comprehensive and more accurate.

Description

Code feature extraction method and system based on hierarchical comparison learning
Technical Field
The invention relates to the technical field of code feature extraction, in particular to a code feature extraction method and system based on hierarchical comparison learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the existing method, a program syntactic feature representation method usually represents a syntactic structure of a source code by using the structural characteristics of an Abstract Syntax Tree (AST) of a code fragment after the code fragment is analyzed, however, the code structure information cannot be comprehensively extracted by simply modeling the Abstract syntax tree of the program, and the structural characteristics of the generated AST are not deeply mined, so that the information is lost. In the graph embedding part, a graph convolution network is often used to increase the receptive field by stacking to acquire local information, but global information cannot be obtained, so that the feature information extraction is not comprehensive. In general, the expression capability of the model is weakened due to the problem that the gradient is reduced or even disappears in the graph network.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a code feature extraction method and a system based on hierarchical comparison learning;
in a first aspect, the invention provides a code feature extraction method based on hierarchical comparison learning;
the code feature extraction method based on hierarchical comparison learning comprises the following steps:
acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
In a second aspect, the invention provides a code feature extraction system based on hierarchical comparison learning;
a code feature extraction system based on hierarchical comparison learning comprises:
a parsing module configured to: acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
a matrix update module configured to: establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
a feature extraction module configured to: inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
the syntactic information of the program is extracted through the AST, the analyzed AST is classified according to the labels given by the node levels, the single-label multi-classification relation is established, and the structural information of the program is fully mined, so that the expression capability of the generated model is more comprehensive and more accurate.
In the graph embedding part, a model combining a residual self-attention mechanism and a graph convolution network is provided to realize global and local feature extraction, global reference is realized while self features are enhanced, residual connection is added to the inside and the outside of a self-attention module of the model, the gradient descent problem is effectively solved, and the expression capability of the model is enhanced.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of the first embodiment;
fig. 2 is a diagram illustrating an AST tree visualization according to the first embodiment;
fig. 3 is a schematic diagram of a self-attention mechanism module according to a first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a code feature extraction method based on hierarchical comparison learning;
as shown in fig. 1, the code feature extraction method based on hierarchical comparison learning includes:
s101: acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree;
embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
s102: establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
s103: inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed;
wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
Further, the residual self-attention network model, abbreviated as RSGN in english, is called all: the Residual Self-attribution Graph Network.
Further, the step S101: analyzing the code segment to be processed to generate an AST tree; the method specifically comprises the following steps:
resolving the code by using JavaParser to generate a corresponding AST tree;
the code is analyzed by using JavaParser, and two parts of information are obtained after analysis: information represented by AST tree nodes; and secondly, the node points to the node and stores the information of the edge.
The information represented by the AST tree node comprises: the specific information of a certain line of codes corresponding to the nodes and the type of the certain line of codes corresponding to the nodes.
The parsed AST tree may be visualized using an online visualization tool. As shown in fig. 2.
Further, the step S101: embedding and representing the AST tree to obtain a feature matrix X; the method specifically comprises the following steps:
using word embedding, performing embedding expression on character strings corresponding to each node of the AST tree to obtain a feature matrix X, wherein X is { p ═ p1,p2,...,pN},pNA vector representation representing the nth node.
Further, the step S101: constructing an adjacent matrix A of the AST tree; the method specifically comprises the following steps:
constructing an adjacency matrix A according to whether the AST tree nodes have the directional relation;
the adjacent matrix A is used for storing the orientation relation between the nodes, if the orientation relation exists between the nodes, the corresponding element value of the adjacent matrix A is 1, otherwise, the corresponding element value is 0.
The adjacent matrix A is a matrix of N x N.
Further, the S102: establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X'; the method specifically comprises the following steps:
on the basis that each node corresponds to a unique ID, making a corresponding label for the depth of the position of the node in the tree, traversing the AST tree from a root node, marking the root node as 0, marking a first-layer child node as 1, marking a second-layer child node as 2, and so on, marking an M-th-layer child node as M;
updating the feature matrix X based on the node labels to obtain a new feature matrix
Figure BDA0003024421950000051
The label is the level of the node, and the category is the level number of the node.
Further, the step S103: inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; the method specifically comprises the following steps:
inputting the new feature matrix X' into a self-attention mechanism model of the trained residual self-attention network model to obtain an output matrix Z;
and jointly inputting the output matrix Z and the adjacent matrix A into a graph convolution neural network of the trained residual self-attention network model, and outputting the characteristics of the code to be processed.
Further, as shown in fig. 3, the network structure of the self-attention mechanism model includes:
a first convolution layer, a second convolution layer and a third convolution layer which are arranged in parallel;
the input ends of the first convolution layer, the second convolution layer and the third convolution layer are all used for inputting a new characteristic matrix X';
the output end of the first convolution layer is connected with the input end of the first multiplier;
the output end of the second convolution layer is also connected with the input end of the first multiplier;
the output end of the first multiplier is connected with the input end of the first adder through a softmax function layer;
during the first operation, the input end of the first adder is only connected with the output end of the softmax function layer;
when the operation is not performed for the first time, the input end of the first adder is also used for inputting the output value of the softmax function layer of the last operation;
the output end of the first adder is connected with the input end of the second multiplier;
the input end of the second multiplier is also connected with the output end of the third convolution layer;
the output end of the second multiplier is also connected with the input end of the fourth convolution layer;
the output end of the fourth convolution layer is also connected with the input end of the second adder;
the input end of the second adder is used for inputting a new feature matrix X';
and the output end of the second adder is used for outputting the matrix Z.
Further, the working principle of the self-attention mechanism model comprises:
feature matrix
Figure BDA0003024421950000071
As input to the first, second and third convolutional layers, passes through the first, second and third convolutional layer pairs
Figure BDA0003024421950000072
Performing linear mapping, and respectively calculating a query, a key and a value, namely: q, K and V;
Q=GCN(X)=XWQ
wherein, WQThe calculation modes of K and V are the same as the learnable weight matrix;
then, calculating the similarity between Q and K through the matrix multiplication operation of a first multiplier;
calculating score through a softmax function layer to obtain 0-1 weights, taking the weights as attention coefficient attention, and storing the output of the current softmax operation as attn;
when the operation is not performed for the first time, adding h _ attn output by the last softmax function layer to attn through a first adder, and performing matrix multiplication on the obtained addition result and V through a second multiplier;
performing convolution operation on the result obtained by the matrix multiplication operation through a fourth convolution layer to extract global features;
finally, the convolution operation result is input with the original input
Figure BDA0003024421950000073
And performing residual operation to obtain the output of self-attitude, and inputting the output into the graph convolution neural network.
And adding a residual error inside self-attack to solve the problem of network gradient descent. Therefore, residual errors are added inside the self-attention mechanism, residual error connection is added outside the self-attention mechanism, and the network capacity is improved.
Further, the graph convolution neural network model has a network structure including:
the first graph volume layer, the first hidden layer, the second graph volume layer and the second hidden layer are connected in sequence.
Further, the working principle of the graph convolution neural network model comprises:
taking an output matrix Z of the self-attention mechanism model and a matrix A (N x N) formed by the pointing relation of nodes as the input of a GCN network, and carrying out graph embedding on the AST tree by two graph convolution layers of the GCN network to extract features; extracting the feature vector of the last hidden layer as the feature vector of the node after the node of the abstract syntax tree is subjected to level comparison learning, and expressing as follows:
Figure BDA0003024421950000081
the GCN is a neural network layer and carries out information propagation between layers:
Figure BDA0003024421950000082
Figure BDA0003024421950000083
representing the sum of the adjacency matrix A and the identity matrix I, and introducing self nodes, namely:
Figure BDA0003024421950000084
is composed of
Figure BDA0003024421950000085
A degree matrix of (c); h(l)Is the active cell matrix of the l-th layer, for the input layer, H(0)X; σ is a nonlinear activation function, H(0)Is an initial vector representation, W(l)Is the weighting parameter matrix of the l-th layer graph convolution network layer.
Constructing two graph convolution layers, wherein the activating functions respectively use ReLU and Softmax, and the overall forward propagation formula is as follows:
Figure BDA0003024421950000086
further, the training step of the trained residual self-attention network model comprises:
constructing a residual self-attention network model;
constructing a training set, wherein the training set is a known code of a known extraction characteristic;
analyzing the code segment of the known code to generate an AST tree;
embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
taking the new feature matrix X' as an input value of the self-attention mechanism model;
taking the output value of the self-attention mechanism model and an adjacent matrix A of the AST tree as the input value of the graph convolution neural network model;
and (3) taking the known extracted features as output values of the graph convolution neural network, training the graph convolution neural network, and stopping training when the loss function reaches the minimum value or the iteration times meet the set requirements to obtain a trained residual self-attention network model.
Wherein the loss function is:
Figure BDA0003024421950000091
Figure BDA0003024421950000092
wherein M represents the number of categories; y isicRepresenting a variable 0 or 1, which is 1 if the class is the same as the class of sample i, and 0 otherwise; p is a radical oficThe predicted probability that the observation sample i belongs to class c.<a,p,n>Representing a triple, wherein a is an anchor sample, p is a positive sample, and n is a negative sample; | | is the Euclidean distance measurement; α means that there is a minimum separation between the distance between x _ a and x _ n and the distance between x _ a and x _ p; + represents [, ]]When the internal value is greater than zero, taking the value as loss; when less than zero, the loss is zero; theta refers to the proportion of triad loss.
It will be appreciated that two loss functions are used to better distinguish between input details, one is cross-entropy loss for multi-classification and the other is triple loss. The triple loss can distinguish details well, and especially in the classification task, when two inputs are similar, the triple loss can learn better representation of the two input vectors with small differences, so that the triple loss is excellent in the classification task.
The residual self-attention network model simultaneously combines the advantage that the convolutional neural network obtains local information through lamination and the characteristic that the self-attention mechanism obtains global information, and extracts features from the local aspect and the global aspect. In order to solve the problem that gradient is reduced or even disappears in a graph network, residual connection is added in the model outside and inside self-attribute at the same time, gradient disappearance can be effectively prevented, and the expression capability of the network is enhanced.
Example two
The embodiment provides a code feature extraction system based on hierarchical comparison learning;
a code feature extraction system based on hierarchical comparison learning comprises:
a parsing module configured to: acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
a matrix update module configured to: establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
a feature extraction module configured to: inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
It should be noted here that the parsing module, the matrix updating module and the feature extracting module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The code feature extraction method based on hierarchical comparison learning is characterized by comprising the following steps:
acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
2. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
analyzing the code segment to be processed to generate an AST tree; the method specifically comprises the following steps:
resolving the code by using JavaParser to generate a corresponding AST tree;
the code is analyzed by using JavaParser, and two parts of information are obtained after analysis: information represented by AST tree nodes; and secondly, the node points to the node and stores the information of the edge.
3. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
embedding and representing the AST tree to obtain a feature matrix X; the method specifically comprises the following steps:
using word embedding, performing embedding expression on character strings corresponding to each node of the AST tree to obtain a feature matrix X, wherein X is { p ═ p1,p2,...,pN},pNA vector representation representing the nth node.
4. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
constructing an adjacent matrix A of the AST tree; the method specifically comprises the following steps:
constructing an adjacency matrix A according to whether the AST tree nodes have the directional relation;
the adjacent matrix A is used for storing the orientation relation between the nodes, if the orientation relation exists between the nodes, the corresponding element value of the adjacent matrix A is 1, otherwise, the corresponding element value is 0.
5. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X'; the method specifically comprises the following steps:
on the basis that each node corresponds to a unique ID, making a corresponding label for the depth of the position of the node in the tree, traversing the AST tree from a root node, marking the root node as 0, marking a first-layer child node as 1, marking a second-layer child node as 2, and so on, marking an M-th-layer child node as M;
updating the feature matrix X based on the node labels to obtain a new feature matrix
Figure FDA0003024421940000021
6. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; the method specifically comprises the following steps:
inputting the new feature matrix X' into a self-attention mechanism model of the trained residual self-attention network model to obtain an output matrix Z;
and jointly inputting the output matrix Z and the adjacent matrix A into a graph convolution neural network of the trained residual self-attention network model, and outputting the characteristics of the code to be processed.
7. The method of claim 1, wherein the hierarchical contrast learning-based code feature extraction method,
the working principle of the self-attention mechanism model comprises the following steps:
feature matrix
Figure FDA0003024421940000022
As input to the first, second and third convolutional layers, passes through the first, second and third convolutional layer pairs
Figure FDA0003024421940000023
Performing linear mapping, and respectively calculating a query, a key and a value, namely: q, K and V;
Q=GCN(X)=XWQ
wherein, WQThe calculation modes of K and V are the same as the learnable weight matrix;
then, calculating the similarity between Q and K through the matrix multiplication operation of a first multiplier;
calculating score through a softmax function layer to obtain 0-1 weights, taking the weights as attention coefficient attention, and storing the output of the current softmax operation as attn;
when the operation is not performed for the first time, adding h _ attn output by the last softmax function layer to attn through a first adder, and performing matrix multiplication on the obtained addition result and V through a second multiplier;
performing convolution operation on the result obtained by the matrix multiplication operation through a fourth convolution layer to extract global features;
finally, the convolution operation result is input with the original input
Figure FDA0003024421940000031
Performing residual operation to obtain the output of self-attitude, and inputting the output into a graph convolution neural network;
alternatively, the first and second electrodes may be,
the working principle of the graph convolution neural network model comprises the following steps:
taking an output matrix Z of the self-attention mechanism model and a matrix A formed by the pointing relation of nodes as the input of the GCN, and carrying out graph embedding on the AST tree by a two-layer graph convolution layer of the GCN to extract features; extracting the feature vector of the last hidden layer as the feature vector of the node after the node of the abstract syntax tree is subjected to level comparison learning, and expressing as follows:
Figure FDA0003024421940000032
8. the code feature extraction system based on hierarchical comparison learning is characterized by comprising the following steps:
a parsing module configured to: acquiring a code to be processed; analyzing the code segment to be processed to generate an AST tree; embedding the AST tree to obtain a characteristic matrix X, and constructing an adjacent matrix A of the AST tree;
a matrix update module configured to: establishing a node label according to the hierarchy number of each node of the AST tree in the AST tree; updating the feature matrix X based on the node labels to obtain a new feature matrix X';
a feature extraction module configured to: inputting the new feature matrix X' and the adjacent matrix A into the trained residual self-attention network model to obtain the features of the code to be processed; wherein the residual self-attention network model comprises: the self-attention mechanism model and the graph convolution neural network model are connected in sequence.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202110411676.XA 2021-04-16 2021-04-16 Code feature extraction method and system based on hierarchical comparison learning Pending CN113238797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110411676.XA CN113238797A (en) 2021-04-16 2021-04-16 Code feature extraction method and system based on hierarchical comparison learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110411676.XA CN113238797A (en) 2021-04-16 2021-04-16 Code feature extraction method and system based on hierarchical comparison learning

Publications (1)

Publication Number Publication Date
CN113238797A true CN113238797A (en) 2021-08-10

Family

ID=77128322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110411676.XA Pending CN113238797A (en) 2021-04-16 2021-04-16 Code feature extraction method and system based on hierarchical comparison learning

Country Status (1)

Country Link
CN (1) CN113238797A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067349A (en) * 2022-01-12 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Target object processing method and device
CN115268994A (en) * 2022-07-26 2022-11-01 中国海洋大学 Code feature extraction method based on TBCNN and multi-head self-attention mechanism
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067349A (en) * 2022-01-12 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Target object processing method and device
CN115268994A (en) * 2022-07-26 2022-11-01 中国海洋大学 Code feature extraction method based on TBCNN and multi-head self-attention mechanism
CN115268994B (en) * 2022-07-26 2023-06-09 中国海洋大学 Code feature extraction method based on TBCNN and multi-head self-attention mechanism
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree
CN117573084B (en) * 2023-08-02 2024-04-12 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree

Similar Documents

Publication Publication Date Title
CN113238797A (en) Code feature extraction method and system based on hierarchical comparison learning
EP4009219A1 (en) Analysis of natural language text in document using hierarchical graph
CN106997474A (en) A kind of node of graph multi-tag sorting technique based on deep learning
CN112364352B (en) Method and system for detecting and recommending interpretable software loopholes
CN108664512B (en) Text object classification method and device
CN113220876B (en) Multi-label classification method and system for English text
CN113591457A (en) Text error correction method, device, equipment and storage medium
US20180365594A1 (en) Systems and methods for generative learning
CN110413319A (en) A kind of code function taste detection method based on deep semantic
CN113641819B (en) Argumentation mining system and method based on multitasking sparse sharing learning
CN112528634A (en) Text error correction model training and recognition method, device, equipment and storage medium
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
WO2020088338A1 (en) Method and apparatus for building recognition model
Okhotin Input-driven languages are linear conjunctive
CN113722477B (en) Internet citizen emotion recognition method and system based on multitask learning and electronic equipment
CN115017907A (en) Chinese agricultural named entity recognition method based on domain dictionary
CN109901869A (en) A kind of computer program classification method based on bag of words
CN114579763A (en) Character-level confrontation sample generation method for Chinese text classification task
CN113378009A (en) Binary neural network quantitative analysis method based on binary decision diagram
CN113515591A (en) Text bad information identification method and device, electronic equipment and storage medium
de Souto et al. Equivalence between ram-based neural networks and probabilistic automata
CN117349186B (en) Program language defect positioning method, system and medium based on semantic flowsheet
CN114911909B (en) Address matching method and device combining deep convolutional network and attention mechanism
JP2023542146A (en) Method for performing address matching and related electronic devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination