CN115129364A - Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network - Google Patents

Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network Download PDF

Info

Publication number
CN115129364A
CN115129364A CN202210782999.4A CN202210782999A CN115129364A CN 115129364 A CN115129364 A CN 115129364A CN 202210782999 A CN202210782999 A CN 202210782999A CN 115129364 A CN115129364 A CN 115129364A
Authority
CN
China
Prior art keywords
graph
neural network
source code
code
syntax tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210782999.4A
Other languages
Chinese (zh)
Other versions
CN115129364B (en
Inventor
张磊
郭迪骁
刘亮
陶锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210782999.4A priority Critical patent/CN115129364B/en
Publication of CN115129364A publication Critical patent/CN115129364A/en
Application granted granted Critical
Publication of CN115129364B publication Critical patent/CN115129364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a fingerprint identity recognition method and a fingerprint identity recognition system based on an abstract syntax tree and a graph neural network, which relate to the network space security technology and comprise the steps of preprocessing a source code; constructing an abstract syntax tree of a source code, performing feature selection on token values of nodes of the abstract syntax tree, and adding edges of different types to construct a code feature graph; adopting a code characteristic diagram training diagram to match a neural network, and generating a diagram embedding characteristic vector of a source code; and (3) calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author. The method combines the abstract syntax tree and the graph neural network technology to generate the abstract syntax tree of the source code, constructs a source code characteristic graph by adding side information and screening token values, generates a graph embedding characteristic vector of the characteristic graph by using the graph neural network with an attention mechanism, and identifies and judges the author identity of the source code by the twin neural network.

Description

Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
Technical Field
The invention relates to the technical field of network space security, in particular to a fingerprint identity recognition method and system based on an abstract syntax tree and a graph neural network.
Background
Source code authorship identifies a given code authorship. As the number of malware increases and mutation techniques evolve, malware authors are creating a large number of malware variants. To better address this problem, a method of checking the identity of the malicious code author is necessary. The source code author identity attribution technology mainly comprises author feature extraction and identity recognition. In the prior art, a natural language processing method of a pure text is generally adopted to extract the programming characteristics of an author by using a language model N-gram, the method needs a large amount of time, the extracted code characteristics have no robustness, and the identification accuracy is low. In order to accurately and efficiently extract identity characteristics of a source code, some methods propose that source code characteristic extraction is performed by introducing a manual characteristic set consisting of layout, vocabulary and syntactic characteristics, the characteristics can resist code formatting and obfuscation technologies, but the vocabulary characteristics and the syntactic characteristics are separated, the two characteristics are closely related in the source code, the separated extraction can cause extraction of some wrong or useless characteristics, the recognition accuracy is reduced, common keywords such as function print appear in all samples, the extraction of the characteristics does not help the recognition accuracy, and the characteristic quantity extracted in the mode is large and cannot be well expanded into a large author data set; the other method adopts a characteristic extraction method of a neural network to convert keywords of the source code into word frequency vectors, then, a neural network is used for learning and extracting characteristics, but the method adopts methods of converting Word frequency-inverse text frequency index TF-IDF or words into vectors Word2Vec and the like to process source codes, ignores grammatical characteristics of control flow and data flow, and the traditional deep neural network can not well process the non-Euclidean structure of source code, because the size of the non-Euclidean structure of the source code is arbitrary, the topological structure is complex, the spatial locality is not the same as that of an image, a fixed node sequence is not available, operations such as convolution and the like of the traditional deep learning cannot be used on the non-Euclidean structure, the non-Euclidean structure can be trained only after the operations are converted into Euclidean structures such as word frequency vectors and the like, and partial features are lost in the conversion process, so that the problem of inaccurate source code identity feature extraction is caused.
Disclosure of Invention
The invention aims to provide a fingerprint identity recognition method and system based on an abstract syntax tree and a graph neural network, which are used for solving the problems that extraction errors or useless features are caused by separating close features, the extracted feature quantity is large, and the extracted source code identity features are inaccurate in the method for recognizing source code identity features in the prior art.
The invention solves the problems through the following technical scheme:
a fingerprint identification method based on an abstract syntax tree and a graph neural network comprises the following steps:
step S10, preprocessing the source code;
step S20, constructing an abstract syntax tree of the source code, and constructing a code characteristic diagram by carrying out characteristic selection on token values of abstract syntax tree nodes and adding edges of different types;
step S30, adopting a code feature diagram training diagram to match with a neural network, and generating a diagram embedding feature vector of a source code;
and step S40, calculating and judging two graph embedding feature vectors output by the graph matching neural network by adopting the twin neural network, and identifying whether the two source codes belong to the same author.
The step S10 includes deleting comments in the source code and internal functions called by the source code, line breaks in the normalized source code, tab breaks, and spaces;
the step S20 includes:
step S21, modifying the AST generating module, adding variable name information to obtain data enhanced AST, analyzing the source code by using the data enhanced AST, and generating an abstract syntax tree of the source code; AST (abstract Syntax Tree) is abstract Syntax tree;
step S22, screening the characteristics of an abstract syntax tree by using a word frequency-inverse text frequency index TF-IDF algorithm;
step S23, add edges representing control and data flows, to preserve user-programmed features of the source code,
and step S24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.
Adding Parent, Child, sitting, Token, Nextuse, If, While, For the control flow edge, extracting the preference of the user to the loops, and the loop types and frequencies used by the users with different programming habits are different to represent the edges of the control flow and the data flow to reserve the user programming characteristics of the source code; the information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the information of the number of parameters, the type of the parameters, the transmission of the variable and other information of the function is mainly reserved, and all the edges are added with reverse edges to generate a source code feature map.
The step S30 specifically includes: the graph matching neural network performs cross-graph learning and updating on input labeled code feature graph pairs, the format of the labeled feature graph is (G1, G2, label), if the code feature graphs belong to the same author, the label is 1, otherwise, the label is-1, and finally, low-dimensional graph embedding feature vectors of the source code are generated through graph pooling.
The fingerprint identity recognition system based on the abstract syntax tree and the graph neural network comprises a source code characteristic diagram generation module, the graph matching neural network and a twin neural network, wherein:
the source code characteristic graph generating module is used for preprocessing a source code, generating a code abstract syntax tree by utilizing the proposed data enhanced AST and constructing a code characteristic graph of the source code by adding bidirectional characteristic edges and screening characteristic tokens;
the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;
and the twin neural network is used for generating two function embedding vectors according to the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through the cosine distance, and judging whether the two source codes belong to the same author.
The graph matching neural network consists of an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edge and node features to initial nodes and edge vectors by using a single multi-layer perceptron MLP, the propagation layer maps a group of nodes to new node representations through multiple rounds of learning and attention mechanisms, and the aggregation layer uses a graph node set as input to calculate feature graph embedding.
The twin neural network consists of two identical basic networks and a matcher, wherein the two identical basic networks are used for abstracting high-level feature vectors from the input feature vectors, and the matcher is used for calculating the similarity score of the two high-level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a softmax layer.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the method combines an abstract syntax tree and a graph neural network technology, uses the abstract syntax tree of the data enhanced AST to generate the source code, constructs a source code feature graph by adding side information and screening tokens, greatly reduces the number of features which are useless for the author identity recognition, increases the data flow and control flow features, uses the graph neural network with an attention mechanism to generate a graph embedding feature vector of the feature graph, increases the training purpose, focuses on the programming features which are more effective for the author identity recognition, and recognizes and judges the author identity of the source code through a twin neural network.
(2) The method not only extracts the syntactic and semantic characteristics of the source code, but also extracts the characteristics of the data stream and the control stream, and under the condition that the author and the code sample data are consistent, the identification accuracy rate of the method is superior to that of other methods.
(3) The feature extraction method for constructing the code feature graph can be expanded to other programming languages, and has good expandability.
(4) The invention realizes the de-anonymization of the source code, and can be applied to the application fields of malicious code author identity tracing detection, copyright dispute, plagiarism and the like.
Drawings
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a block diagram of a source code feature map generation module;
FIG. 3 is a diagram of an abstract syntax tree derived from source code;
FIG. 4 is a block diagram of a graph matching neural network;
FIG. 5 is a block diagram of a twin neural network;
FIG. 6 is a flow chart of a code author identification process.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
with reference to fig. 1, a fingerprint identification method based on an abstract syntax tree and a graph neural network includes:
step S10: preprocessing source codes to eliminate the influence of different integrated development environments, converting the source codes into a tree structure based on an abstract syntax tree, and constructing a code characteristic diagram by performing characteristic selection on token values of tree nodes and adding different types of edges;
in this embodiment, all data sets filed by Google Code Jam of Google global programming challenge race in 2008 + 2020 are collected and used, including important information such as race problems, participants, and submission codes and Code types of the participants, and Google Code Jam (GCJ) is an international programming competition sponsored by Google, because the participants of the GCJ race almost include programmers with different education levels in all countries and regions, the real scene of de-anonymization can be well simulated. Step S10 specifically includes:
step S11: in this embodiment, two preprocessing modes are respectively performed on a source code, one is to extract layout features of the source code, the other is to perform normalization processing on the source code, delete comments of the source code and called internal functions, change line symbols and tab symbols in the normalization code, spaces and the like, and considering that different code integration development environments used by different programmers in a real programming environment are different, and different code layouts and comments are also information belonging to identity features of the programmers authors, a method for extracting code layout features is designed, and the extracted code layout features are shown in table 1:
TABLE 1 layout characteristics
Figure DEST_PATH_IMAGE001
Experimental results show that the recognition accuracy of the layout features and the AST-based graph pooling features in the programmer recognition work of three scales of 50, 100 and 1000 is basically the same as that of the AST-based graph pooling features which are used alone, but the required training time and training cost are increased, so that the embodiment finally selects to delete comments in the source code and internal functions called by the source code in advance, standardize line feed characters, tab characters, blank spaces and the like in the code, eliminate the influence of different integrated development environments, reduce the number of features and improve the calculation efficiency;
step S12: modifying an AST generation module, adding variable name information to obtain data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code;
because the characteristics of the programmer fingerprint need to be extracted, the characteristics of the variable name, the function name, the number of function parameters, and the like in the source code need to be captured, so in this embodiment, a new method for generating the AST is adopted, which is referred to as data-enhanced AST (data-augmented AST), a conventional AST generation module only generates a corresponding class for a variable, a constant, and the like, for example, in an AST module of python, only a variable class name is returned for a variable a, and the variable name a only serves as a value of the class name. As shown in fig. 3, a simple input source code is on the left, a converted abstract syntax tree structure is on the right, an add function is defined in the left source code, and is used for calculating and returning the value of the parameter a plus the parameter b, and then the parameters of the add function are called to be 2 and 3; the AST root node on the right is a Module, the left sub-tree of the root node is a function definition part and comprises functions, definitions and return nodes, parameters a and b are used as child nodes respectively, and the right sub-tree of the root node is a function calling part and comprises call, Funcname and input data Num (2) and Num (3) nodes.
Step S13: the AST tree (abstract syntax tree) structure generated by the invention contains all user-defined variable names, under the condition of a large sample data set, the number of tokens is overlarge, approximately 20000 token values are generated in the data set of 4000 codes in total of 100 authors, in order to prevent overfitting and reduce training cost, a TF-IDF algorithm is used for screening features, and the token value is finally selected to be 8000 through multiple experiments. TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
Step S14: adding Parent, Child, sitting, Token, Nextuse, If, While and For representing the edges of control flow and data flow to reserve the user programming characteristics of a source code, wherein aiming at the control flow edge, only three cyclic control edges of If, While and For are selected in the embodiment, which are cyclic structures commonly contained by C/C + +, Python and Java, and the cross-programming language identification function of programmer fingerprint identification can be realized. In addition, the control flow edges contain the user's preferences for these several loops when writing code, and the types and frequency of loops used by users of different programming habits are different, which is important to programmer fingerprint characterization. The information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the characteristics of the number of parameters, the type of the parameters, the transfer information of the variable and the like of the function are mainly reserved. Wherein, the Parent edge connects the non-root node and the Parent node; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge is connected with the terminal node and the other terminal node; the Nextuse edge connects the variable node and its next occurring node. In the present embodiment, reverse edges are added to all the edges, the effect of the reverse edges is tested, and in the data set using 100 programmers in step S11, feature maps with and without the reverse edges added are constructed, respectively, and training is performed using a GMN map matching neural network. Experiments show that the identification accuracy rate and the loss value, namely the loss value tend to be consistent when the round epoch is 20 in the characteristic diagram with the added reverse edge, and the identification accuracy rate and the loss value tend to be consistent when the round epoch is 30 in the characteristic diagram without the added reverse edge. The experimental result shows that the convergence rate of the graph matching model can be increased by the reverse edge, the same data is concentrated, the epoch value required by the model using the reverse edge to reach the highest accuracy rate is smaller, and the model training efficiency can be improved. In this embodiment, a source code feature map is generated through the above steps, and as shown in fig. 2, for an If loop structure in an abstract syntax tree, we add a feature edge of condtrose in the Then executed when the loop condition Conditon and the condition are satisfied, add a reverse ForNext feature edge, and add a feature edge of Condfalse in the Else executed when the loop condition Conditon and the condition are not satisfied; for the While loop structure in the abstract syntax tree, adding a WhileExec characteristic edge and an inverse WhileNoxt characteristic edge in the loop condition Conditon and the loop Body; for the For cycle structure in the abstract syntax tree, adding a ForExec characteristic edge and a reverse ForNext characteristic edge in a cycle condition ForControl and a cycle subject Body; for the sequential execution structure in the abstract syntax tree, the characteristic edge of NextStmt is added between the Statement straight of sequential execution.
Step S15: and converting the edges and the nodes into embedded vectors through hashing, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.
Step S20: training a graph matching neural network with an attention mechanism by using the source code feature pattern corpus generated in the step S10, wherein the graph matching network performs cross-graph learning and updating on an input labeled sample pair, the format of the labeled code feature graph is (G1, G2, label), if the code feature graph belongs to the same author, label is 1, otherwise-1, the attention mechanism enables the graph matching neural network to find the most important node and edge features in the programmer feature graph in the cross-graph training and learning, and the attention value calculation formula is as follows:
Figure 816982DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
wherein
Figure 530861DEST_PATH_IMAGE004
Is the current hidden state at time step t,
Figure DEST_PATH_IMAGE005
is a unit of a GRU that is,
Figure 107335DEST_PATH_IMAGE006
is a vector similarity function, referred to as cosine similarity in this embodiment,
Figure DEST_PATH_IMAGE007
is shown in
Figure 564862DEST_PATH_IMAGE004
The difference between the node of the next first graph and its nearest neighbor in another graph.
Figure 644813DEST_PATH_IMAGE008
The weight of attention is represented as a weight of attention,
Figure DEST_PATH_IMAGE009
represent
Figure 765478DEST_PATH_IMAGE004
The difference between its nearest neighbor nodes in another graph,
Figure 563670DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
is a set of nodes of the first graph,
Figure 559308DEST_PATH_IMAGE012
is the set of nodes of the second graph),
Figure DEST_PATH_IMAGE013
refers to the current hidden state of node j at time step t,
Figure 759345DEST_PATH_IMAGE004
refers to the current hidden state of node i at time step t,
Figure 283867DEST_PATH_IMAGE014
in the formula, the hidden state of the nearest neighbor node corresponding to the node i in the first graph in the second graph at the time step t is shown, i and j refer to the node, and i and j refer to the edge.
Finally, generating low-dimensional graph embedded feature vectors of source codes through global graph pooling, as shown in FIG. 4, inputting a pair of feature graphs, mapping original node features and edge features into a feature initial vector by using a multilayer perceptron MLP (Multi-level perceptron), mapping a series of node feature vector sets into a new node feature vector set by using a propagation layer, updating the node feature vectors by using not only information of adjacent nodes in each graph but also cross-graph matching vectors in updating the node feature vectors by using the cross-graph matching vectors, wherein the cross-graph matching vectors describe the matching degree between the nodes in the current graph and the nodes in another graph, and an attention mechanism is added, and the low-dimensional graph embedded feature vectors of the source codes are generated by using the multi-level perceptron MLP (Multi-level perceptron MLP)
Figure DEST_PATH_IMAGE015
Representing the matching degree of the nodes in the graph i and the corresponding nodes in the graph j, and serving as the update weight of the node vector through a certain number of roundsAfter the propagation, the aggregation layer takes the node set of the feature graph as input, and calculates the embedded feature vector of the graph level through graph pooling; in this embodiment, the effects of the graph matching neural network (GMN) and the Gated Graph Neural Network (GGNN) are compared, and the graph matching neural network is finally selected, and the comparison result is shown in table 2:
TABLE 2 accuracy of neural networks of different figures
Figure 834934DEST_PATH_IMAGE016
Step S30: inputting two graphs output by the graph matching network into a twin neural network designed in the embodiment, learning the two input feature vectors again by the twin neural network, and finally identifying and judging whether two source codes belong to the same author or not by calculating cosine distances, as shown in fig. 5;
the graph matching neural network in the embodiment is composed of an encoder, a propagation layer and an aggregation layer. As shown in fig. 4, the graph matching neural network is to compute and aggregate graph nodes to output graph-embedded feature vectors.
In this embodiment, the twin neural network is composed of three parts, two identical basic networks and a matcher, and as shown in fig. 5, the two identical basic networks are responsible for abstracting a high-level feature vector from input feature vectors, and the matcher is responsible for calculating a similarity score of the two high-level feature vectors.
Example 2:
with reference to fig. 1, the fingerprint identification system based on the abstract syntax tree and the graph neural network includes a source code feature map generation module, a graph matching neural network generator and a twin neural network discriminator, wherein:
the source code feature map generation module is used for generating an abstract syntax tree by preprocessing a source code and constructing a code feature map based on the abstract syntax tree, firstly eliminating comments in one source code and then normalizing tab characters, line feed characters, spaces and the like in the code, so that the influence of different code development environments on the identification of the identity of an author is prevented, the feature number is reduced, and the calculation efficiency is improved; secondly, the preprocessed source code is processed by using the proposed data enhancement AST, information such as a source code variable name and the like is extracted, an abstract syntax tree is generated, based on the abstract syntax tree, a token value is screened by using TF-IDF, characteristic dimensionality is reduced, and then a code characteristic graph is constructed by adding 8 different types of edges, wherein a Parent edge is connected with a non-root node and a Parent node of the non-root node; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge is connected with the terminal node and the other terminal node; the Nextuse edge connects variable nodes and nodes appearing next time, information of functions, variable names, key tokens and data flow directions of programmers is extracted from the data flow edge in emphasis, in order to make up For the defect that AST only contains grammatical features, 3 edges of loop structure types suitable For most programming languages are added to extract control flow information of user codes, namely While edges, For edges and While edges, the edge construction is shown in figure 2, different control flow edges contain preferences of users on the loops in programmer fingerprint identification, loop types and frequencies used by users with different programming habits are different, and finally, For all types of edges, respective backward edges are introduced, so that the types and the number of the edges are doubled, the information entropy is increased, and the backward edges are beneficial to faster information propagation in the graph neural network.
The graph matching neural network generator is used for outputting graph embedding feature vectors and learning the characteristics of the fingerprints of the programmers according to the feature graph generated by the source code feature graph generating module as input;
and the twin neural network discriminator is used for taking two feature vectors (such as the feature vector of the anonymous code and the feature vector of a known programmer) generated by the graph matching neural network generator as input, and outputting similarity scores of the features through the base network (the deep neural network of the 4-layer fully-connected layer) and the discriminator. Two feature vectors indicating whether the anonymous source code is likely to be written by the known programmer. Since this architecture is not related to the number of classes in the dataset, it can be easily extended to new programmers.
The graph matching neural network generator and the twin neural network discriminator need to be trained in advance, the neural networks of the graph matching neural network generator and the twin neural network discriminator are respectively shown in fig. 4 and fig. 5, a Python machine learning library Pytorch is adopted for a training platform, the set parameters are as follows, the batch training size is set to be 64, and the training round number i is set to be 200. For the graph matching neural network generator, the number of graph matching network layers is set to be 4, the graph pooling embedded vector dimension is 400, the learning Rate is 0.001, an adaptive moment estimation adam (adaptive motion) optimizer is used, for the twin neural network arbiter, the base network is a DNN deep neural network with 4 hidden layers, neurons of each layer are 400, 300, 200 and 100, the drop Rate Dropout Rate is set to be 0.2, and the Relu function is used as the activation function.
In the process of judging and identifying a pair of input feature vectors by a twin neural network discriminator, as shown in fig. 6, a pair of source code feature vectors generated by a graph matching neural network generator are used as input, a new pair of feature vectors are generated through a basic subnetwork of the twin neural network, then the discriminator is used for calculating the cosine distance of the two vectors to obtain a similarity score, whether the score is greater than a preset Threshold value is judged, if so, the two source codes belong to the same author, and if not, the two source codes belong to different authors.
The system processes the source code by using a method for constructing the feature graph created by us to generate the feature graph structure of the source code, and the method has the expandability of a programming language and the integrity of feature extraction, and not only comprises the traditional grammatical features, but also comprises the features of data, control flow and the like. And then, a graph neural network and a twin neural network are used for learning the characteristic graph and analyzing and judging the generated graph characteristic vector to identify the author identity, the graph neural network is more suitable for processing a non-Euclidean tree and graph structure of a source code abstract syntax tree and a characteristic graph than neural networks such as CNN, RNN and the like, the characteristics can be more effectively learned and extracted, in addition, the combination of the deep neural network can greatly reduce the dimension of the characteristics, and the speed and the accuracy of source code author identity identification are improved.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims (7)

1. A fingerprint identity recognition method based on an abstract syntax tree and a graph neural network is characterized by comprising the following steps:
step S10, preprocessing the source code;
step S20, constructing an abstract syntax tree of the source code, and constructing a code characteristic diagram by carrying out characteristic selection on token values of abstract syntax tree nodes and adding edges of different types;
step S30, adopting a code characteristic diagram training diagram to match with a neural network, and generating a diagram embedding characteristic vector of a source code;
and step S40, calculating and judging two graph embedding feature vectors output by the graph matching neural network by adopting the twin neural network, and identifying whether the two source codes belong to the same author.
2. The abstract syntax tree and graph neural network based fingerprinting identity recognition method of claim 1, characterized in that said step S10 includes deleting comments in the source code and internal functions called by the source code, wrapping characters, tabs and spaces in the normalized source code.
3. The method for fingerprint identification based on abstract syntax tree and graph neural network of claim 1, wherein the step S20 comprises:
step S21, modifying the AST generating module, adding variable name information to obtain data enhanced AST, analyzing the source code by using the data enhanced AST, and generating the abstract syntax tree of the source code;
step S22, screening the characteristics of an abstract syntax tree by using a word frequency-inverse text frequency index TF-IDF algorithm;
step S23, adding edges representing control flow and data flow to reserve the user programming characteristics of the source code;
and step S24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.
4. The method for fingerprint identification based on abstract syntax tree and graph neural network of claim 1, wherein the step S30 specifically comprises:
the graph matching neural network performs cross-graph learning and updating on input labeled code feature graph pairs, the format of the labeled code feature graph is (G1, G2, label), if the code feature graphs belong to the same author, the label is 1, otherwise, the label is-1, and finally, low-dimensional graph embedding feature vectors of the source code are generated through graph pooling.
5. The fingerprint identity recognition system based on the abstract syntax tree and the graph neural network is characterized by comprising a source code feature graph generation module, a graph matching neural network and a twin neural network, wherein:
the source code feature map generation module is used for preprocessing a source code, generating a code abstract syntax tree by using data enhanced AST (access stratum) and constructing a code feature map of the source code by adding bidirectional feature edges and screening feature token values;
the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, calculating the matching degree of corresponding nodes in a pair of graphs as the weight of node updating, enabling the graph neural network to automatically learn the node characteristic information, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;
and the twin neural network is used for generating two function embedding vectors according to the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through the cosine distance, and judging whether the two source codes belong to the same author.
6. The abstract syntax tree and graph neural network based fingerprinting identity recognition system of claim 5, characterized in that the graph matching neural network is composed of three parts, an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edges and node features to initial nodes and edge vectors using separate multi-layer perceptron MLPs, the propagation layer maps a set of nodes to new node representations through multiple rounds of learning and attention mechanism, the aggregation layer computes feature graph embedding using the set of graph nodes as input.
7. The system of claim 5, wherein the twin neural network comprises two identical basic networks and a matcher, the two identical basic networks are responsible for abstracting the high-level feature vectors from the input feature vectors, and the matcher is responsible for calculating similarity scores of the two high-level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a normalization function softmax layer.
CN202210782999.4A 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network Active CN115129364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210782999.4A CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210782999.4A CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Publications (2)

Publication Number Publication Date
CN115129364A true CN115129364A (en) 2022-09-30
CN115129364B CN115129364B (en) 2023-04-18

Family

ID=83380950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210782999.4A Active CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Country Status (1)

Country Link
CN (1) CN115129364B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN110990273A (en) * 2019-11-29 2020-04-10 中国银行股份有限公司 Clone code detection method and device
US20210004210A1 (en) * 2019-07-01 2021-01-07 X Development Llc Learning and using programming styles
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN113157917A (en) * 2021-03-15 2021-07-23 西北大学 OpenCL-based optimized classification model establishing and optimized classification method and system
CN113312268A (en) * 2021-07-29 2021-08-27 北京航空航天大学 Intelligent contract code similarity detection method
CN113961241A (en) * 2021-11-02 2022-01-21 南京大学 Code clone detection method based on GAT (generic antigen-based) graph neural network model
CN114547619A (en) * 2022-01-11 2022-05-27 扬州大学 Vulnerability repairing system and method based on tree

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10528731B1 (en) * 2017-09-21 2020-01-07 Area 1 Security, Inc. Detecting malicious program code using similarity of hashed parsed trees
US10656940B1 (en) * 2019-02-04 2020-05-19 Architecture Technology Corporation Systems, devices, and methods for source code generation from binary files
CN112394973B (en) * 2020-11-23 2024-03-12 山东理工大学 Multi-language code plagiarism detection method based on pseudo-twin network
CN113360915B (en) * 2021-06-09 2023-09-26 扬州大学 Intelligent contract multi-vulnerability detection method and system based on source code diagram representation learning
CN114327483B (en) * 2021-12-31 2024-10-18 华中科技大学 Graph tensor neural network model building method and source code semantic recognition method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
US20210004210A1 (en) * 2019-07-01 2021-01-07 X Development Llc Learning and using programming styles
CN110990273A (en) * 2019-11-29 2020-04-10 中国银行股份有限公司 Clone code detection method and device
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN113157917A (en) * 2021-03-15 2021-07-23 西北大学 OpenCL-based optimized classification model establishing and optimized classification method and system
CN113312268A (en) * 2021-07-29 2021-08-27 北京航空航天大学 Intelligent contract code similarity detection method
CN113961241A (en) * 2021-11-02 2022-01-21 南京大学 Code clone detection method based on GAT (generic antigen-based) graph neural network model
CN114547619A (en) * 2022-01-11 2022-05-27 扬州大学 Vulnerability repairing system and method based on tree

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WENHAN WANG, GE LI, BO MA, XIN XIA, ZHI JIN: "Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree", 《HTTPS://ARXIV.ORG/ABS/2002.08653V1》 *
吴鹏: "多形态软件代码同源判定技术研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑(月刊)2022 年 第01期》 *

Also Published As

Publication number Publication date
CN115129364B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Kim et al. Semantic sentence matching with densely-connected recurrent and co-attentive information
Ma et al. Joint embedding VQA model based on dynamic word vector
Moon et al. Multimodal named entity disambiguation for noisy social media posts
CN112084331B (en) Text processing and model training method and device, computer equipment and storage medium
Niu et al. Multi-modal multi-scale deep learning for large-scale image annotation
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN114565104A (en) Language model pre-training method, result recommendation method and related device
Peng et al. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism
Yaghoobzadeh et al. Corpus-level fine-grained entity typing
CN114548101B (en) Event detection method and system based on backtracking sequence generation method
US20230138014A1 (en) System and method for performing a search in a vector space based search engine
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
CN113535897A (en) Fine-grained emotion analysis method based on syntactic relation and opinion word distribution
CN117112786A (en) Rumor detection method based on graph attention network
CN113254575B (en) Machine reading understanding method and system based on multi-step evidence reasoning
Zhu et al. Configurable graph reasoning for visual relationship detection
Chen et al. Attention alignment multimodal LSTM for fine-gained common space learning
CN114586038B (en) Method and device for event extraction and extraction model training, equipment and medium
CN115129364B (en) Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
Jung et al. Improving visual relationship detection using linguistic and spatial cues
CN117018632A (en) Game platform intelligent management method, system and storage medium
CN115359486A (en) Method and system for determining custom information in document image
CN116521829A (en) Map question answering method and device, equipment and storage medium
CN117349399B (en) Text classification corpus construction method and system
CN118095261B (en) Text data processing method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant