CN115129364B - Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network - Google Patents

Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network Download PDF

Info

Publication number
CN115129364B
CN115129364B CN202210782999.4A CN202210782999A CN115129364B CN 115129364 B CN115129364 B CN 115129364B CN 202210782999 A CN202210782999 A CN 202210782999A CN 115129364 B CN115129364 B CN 115129364B
Authority
CN
China
Prior art keywords
graph
neural network
source code
syntax tree
abstract syntax
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210782999.4A
Other languages
Chinese (zh)
Other versions
CN115129364A (en
Inventor
张磊
郭迪骁
刘亮
陶锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210782999.4A priority Critical patent/CN115129364B/en
Publication of CN115129364A publication Critical patent/CN115129364A/en
Application granted granted Critical
Publication of CN115129364B publication Critical patent/CN115129364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a fingerprint identity recognition method and a fingerprint identity recognition system based on an abstract syntax tree and a graph neural network, which relate to the network space security technology and comprise the steps of preprocessing a source code; constructing an abstract syntax tree of a source code, performing feature selection on token values of nodes of the abstract syntax tree, and adding edges of different types to construct a code feature graph; adopting a code characteristic diagram training diagram to match a neural network, and generating a diagram embedding characteristic vector of a source code; and (3) calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author. The method combines an abstract syntax tree and a graph neural network technology to generate an abstract syntax tree of a source code, constructs a source code characteristic graph by adding side information and screening token values, generates a graph embedding characteristic vector of the characteristic graph by using a graph neural network with an attention adding mechanism, and identifies and judges the author identity of the source code through a twin neural network.

Description

Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
Technical Field
The invention relates to the technical field of network space security, in particular to a fingerprint identity recognition method and system based on an abstract syntax tree and a graph neural network.
Background
Source code authorship identifies a given code authorship. As the number of malware increases and mutation techniques evolve, malware authors are creating a large number of malware variants. To better address this problem, a method of checking the identity of the malicious code author is necessary. The source code author identity attribution technology mainly comprises author feature extraction and identity identification. In the prior art, a natural language processing method of a pure text is generally adopted to extract the programming characteristics of an author by using a language model N-gram, the method needs a large amount of time, the extracted code characteristics have no robustness, and the identification accuracy is low. In order to accurately and efficiently extract identity characteristics of a source code, some methods propose that source code characteristic extraction is performed by introducing a manual characteristic set consisting of layout, vocabulary and syntactic characteristics, the characteristics can resist code formatting and confusion technologies, but the vocabulary characteristics and the syntactic characteristics are separated, the two characteristics are closely related in the source code, the separated extraction can cause extraction of some wrong or useless characteristics, the recognition accuracy is reduced, common keywords such as function print appear in all samples, the extraction of the characteristics does not help the recognition accuracy, and the characteristic quantity extracted in the way is large and cannot be well expanded into a large author data set; the method adopts a Word frequency-inverse text frequency index TF-IDF or a Word to vector Word2Vec and other methods to process the source code, so that control flow and data flow syntactic characteristics can be ignored, and the traditional deep neural network cannot well process the non-Euclidean structure of the source code, because the size of the non-Euclidean structure of the source code is arbitrary, the topological structure is complex, no spatial locality is the same as an image, no fixed node sequence exists, the operations such as convolution and the like of the traditional deep learning cannot be used on the source code, the source code can be trained only after being converted into an Euclidean structure such as a Word frequency vector and the like, and partial characteristics can be lost in the conversion process, so that the problem of inaccurate identity characteristics of the source code is solved.
Disclosure of Invention
The invention aims to provide a fingerprint identity recognition method and a fingerprint identity recognition system based on an abstract syntax tree and a graph neural network, which are used for solving the problems of extraction errors or useless features caused by separation of close features, large extracted feature quantity and inaccurate extracted source code identity features in the method for recognizing the source code identity features in the prior art.
The invention solves the problems through the following technical scheme:
a fingerprint identity recognition method based on an abstract syntax tree and a graph neural network comprises the following steps:
s10, preprocessing a source code;
s20, constructing an abstract syntax tree of the source code, and constructing a code characteristic diagram by performing characteristic selection on token values of nodes of the abstract syntax tree and adding different types of edges;
s30, matching a neural network by adopting a code characteristic diagram training diagram to generate a diagram embedding characteristic vector of a source code;
and S40, calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author.
The step S10 comprises deleting comments in the source code and internal functions called by the source code, and line feed characters, tab characters and spaces in the normalized source code;
the step S20 includes:
s21, modifying an AST generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code; AST (Abstract Syntax Tree) is Abstract Syntax Tree;
s22, screening the characteristics of an abstract syntax tree by using a word frequency-inverse text frequency index TF-IDF algorithm;
step S23, adding edges representing control and data flows, to preserve the user-programmed features of the source code,
and S24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a Hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.
Adding Parent, child, sitting, token, nextuse, if, while, for the control flow edge, extracting the preference of the user to the loops, and the loop types and frequencies used by the users with different programming habits are different to represent the edges of the control flow and the data flow to reserve the user programming characteristics of the source code; the information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the information of the number of parameters, the type of the parameters, the transmission of the variable and other information of the function is mainly reserved, and the reverse edge is added to all the edges to generate the source code characteristic diagram.
The step S30 specifically includes: the graph matching neural network carries out cross-graph learning and updating on input labeled code feature graph pairs, the format of the labeled feature graph is (G1, G2, label), if the code feature graphs belong to the same author, the label is 1, otherwise, the label is-1, and finally, low-dimensional graph embedding feature vectors of the source code are generated through graph pooling.
The fingerprint identity recognition system based on the abstract syntax tree and the graph neural network comprises a source code feature graph generation module, the graph matching neural network and a twin neural network, wherein:
the source code characteristic graph generating module is used for preprocessing a source code, generating a code abstract syntax tree by utilizing the proposed data enhanced AST and constructing a code characteristic graph of the source code by adding bidirectional characteristic edges and screening characteristic tokens;
the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;
and the twin neural network is used for generating two function embedded vectors from the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through the cosine distance, and judging whether the two source codes belong to the same author.
The graph matching neural network is composed of an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edge and node features to initial nodes and edge vectors by using a single multilayer perceptron MLP, the propagation layer maps a group of nodes to new node representations through multiple rounds of learning and attention mechanisms, and the aggregation layer uses a graph node set as input to calculate feature graph embedding.
The twin neural network consists of two identical basic networks and a matcher, wherein the two identical basic networks are used for abstracting high-level feature vectors from the input feature vectors, and the matcher is used for calculating the similarity score of the two high-level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a softmax layer.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) The method combines an abstract syntax tree and a graph neural network technology, uses the abstract syntax tree of the data enhanced AST to generate the source code, constructs a source code feature graph by adding side information and screening tokens, greatly reduces the number of features which are useless for the author identity recognition, increases the data flow and control flow features, uses the graph neural network with an attention mechanism to generate a graph embedding feature vector of the feature graph, increases the training purpose, focuses on the programming features which are more effective for the author identity recognition, and recognizes and judges the author identity of the source code through a twin neural network.
(2) The method not only extracts the syntactic and semantic characteristics of the source code, but also extracts the characteristics of data flow and control flow, and under the condition that the author and the code sample data are consistent, the identification accuracy rate of the method is superior to that of other methods.
(3) The feature extraction method for constructing the code feature graph can be expanded to other programming languages, and has good expandability.
(4) The invention realizes the de-anonymization of the source code, and can be applied to the application fields of malicious code author identity tracing detection, copyright dispute, plagiarism and the like.
Drawings
FIG. 1 is a block diagram of the system of the present invention;
FIG. 2 is a block diagram of a source code feature map generation module;
FIG. 3 is a diagram of an abstract syntax tree derived from source code;
FIG. 4 is a block diagram of a graph matching neural network;
FIG. 5 is a block diagram of a twin neural network;
FIG. 6 is a flow chart of a code author identification process.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
referring to fig. 1, a fingerprint identification method based on an abstract syntax tree and a graph neural network includes:
step S10: preprocessing source codes to eliminate the influence of different integrated development environments, converting the source codes into a tree structure based on an abstract syntax tree, and constructing a code characteristic diagram by performing characteristic selection on token values of tree nodes and adding different types of edges;
in this embodiment, google global programming challenge game 2008-2020 is collected and used for all data sets filed by Google Code Jam, including important information such as game questions, participants, participant submission codes, code types and the like, and Google Code Jam (GCJ) is an international programming competition sponsored by Google because the participants of the GCJ game almost include programmers of different education levels in all countries and regions, and can well simulate real scenes without anonymization. The step S10 specifically includes:
step S11: in the embodiment, two preprocessing modes are respectively performed on a source code, one is to extract layout characteristics of the source code, the other is to perform normalization processing on the source code, delete comments and called internal functions of the source code, and normalize line feed characters, tab characters, spaces and the like in the code, and considering that different code integration development environments used by different programmers in a real programming environment are different, and different code layouts and comments are also information belonging to identity characteristics of the programmer, a method for extracting the layout characteristics of the code is designed, and the extracted layout characteristics of the code are shown in table 1:
TABLE 1 layout characteristics
Figure DEST_PATH_IMAGE001
Experimental results show that the recognition accuracy of the layout features and the AST-based graph pooling features in the programmer recognition work of three scales of 50, 100 and 1000 is basically the same as that of the AST-based graph pooling features which are used alone, but the required training time and training cost are increased, so that the embodiment finally selects to delete comments in the source code and internal functions called by the source code in advance, standardize line feed characters, tab characters, blank spaces and the like in the code, eliminate the influence of different integrated development environments, reduce the number of features and improve the calculation efficiency;
step S12: modifying an AST generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code;
because the characteristics of the programmer fingerprint need to be extracted, the characteristics of the variable name, the function name, the number of function parameters, and the like in the source code need to be captured, so in this embodiment, a new method for generating the AST is adopted, which is referred to as data-enhanced AST (data-augmented AST), a conventional AST generation module only generates a corresponding class for a variable, a constant, and the like, for example, in an AST module of python, only a variable class name is returned for a variable a, and the variable name a only serves as a value of the class name. As shown in fig. 3, the left side is a simple input source code, the right side is a converted abstract syntax tree structure, the left source code defines an add function for calculating and returning the value of the parameter a plus the parameter b, and then calls the add function parameter 2,3; the AST root node on the right side is a Module, the left sub-tree of the root node is a function definition part which comprises function, entries and return nodes, parameters a and b are used as child nodes respectively, and the right sub-tree of the root node is a function calling part which comprises call, funcname, input data Num (2) and Num (3) nodes.
Step S13: the AST tree (abstract syntax tree) structure generated by the invention contains all user-defined variable names, under the condition of a large sample data set, the number of tokens is overlarge, approximately 20000 token values are generated in the data set of 4000 codes in total of 100 authors, in order to prevent overfitting and reduce training cost, a TF-IDF algorithm is used for screening features, and the token value is finally selected to be 8000 through multiple experiments. TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
Step S14: the method is characterized in that Parent, child, sitting, token, nextuse, if, while and For representing the edges of control flow and data flow are added to reserve the user programming characteristics of a source code, wherein aiming at the control flow edge, only three cyclic control edges of If, while and For are selected in the embodiment, which are cyclic structures commonly contained by C/C + +, python and Java, and the cross-programming language recognition function of programmer fingerprint recognition can be realized. In addition, the control flow edge contains the user's preferences for these several loops when writing code, and the types and frequency of loops used by users with different programming habits are different, which is important for programmer fingerprint characterization. The information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the features of the number of parameters, the type of parameters, the transfer information of the variable and the like of the function are mainly reserved. Wherein, the Parent edge connects the non-root node and the Parent node; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge connects the terminal node and another terminal node; the Nextuse edge connects the variable node and its next occurring node. In this embodiment, the reverse sides are added to all the sides, the effects of the reverse sides are tested, feature maps with and without the addition of the reverse sides are respectively constructed in the data set using 100 programmers in step S11, and training is performed using the GMN map matching neural network. Experiments show that the identification accuracy rate and the loss value, namely the loss value tend to be consistent when the round epoch is 20 in the characteristic diagram with the added reverse edge, and the identification accuracy rate and the loss value tend to be consistent when the round epoch is 30 in the characteristic diagram without the added reverse edge. Experimental results show that the convergence rate of the graph matching model can be increased by the reverse edge, the same data set is realized, the epoch value required by the model with the reverse edge to achieve the highest accuracy is smaller, and the model training efficiency can be improved. In this embodiment, a source code feature map is generated through the above steps, and as shown in fig. 2, for an If loop structure in an abstract syntax tree, we add a feature edge of condtrose in the Then executed when the loop condition Conditon and the condition are satisfied, add a reverse ForNext feature edge, and add a feature edge of Condfalse in the Else executed when the loop condition Conditon and the condition are not satisfied; for the While loop structure in the abstract syntax tree, adding a WhileExec characteristic edge and an inverse WhileExxt characteristic edge in the loop condition Conditon and the loop subject Body; for the For cycle structure in the abstract syntax tree, adding a ForExec characteristic edge and a reverse ForNext characteristic edge in the cycle condition ForControl and the cycle Body; for the sequential execution structure in the abstract syntax tree, the feature edge of NextStmt is added between Statement of sequential execution.
Step S15: and converting the edges and the nodes into embedded vectors through hashing, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.
Step S20: using the source code feature pattern corpus generated in step S10 to train a graph matching neural network with an attention mechanism, wherein the graph matching network performs cross-graph learning and updating on an input labeled sample pair, the labeled code feature pattern is in a format of (G1, G2, label), if the code feature patterns belong to the same author, label is 1, otherwise-1, the attention mechanism enables the graph neural network to find the most important nodes and edge features in the programmer feature pattern in the cross-graph training and learning, and the attention value calculation formula is as follows:
Figure 816982DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
wherein
Figure 530861DEST_PATH_IMAGE004
Is the current hidden state at time step t, <' > is>
Figure DEST_PATH_IMAGE005
Is a GRU unit, is operative to determine whether a GRU unit is present>
Figure 107335DEST_PATH_IMAGE006
Is a vector similarity function, which in this embodiment refers to cosine similarity, based on the value of the inverse cosine function->
Figure DEST_PATH_IMAGE007
Is indicated in>
Figure 564862DEST_PATH_IMAGE004
The difference between the node of the next first graph and its nearest neighbor in another graph. />
Figure 644813DEST_PATH_IMAGE008
Represents an attention weight, is asserted>
Figure DEST_PATH_IMAGE009
Represents->
Figure 765478DEST_PATH_IMAGE004
The difference with its nearest neighbor node in another graph, is->
Figure 563670DEST_PATH_IMAGE010
,/>
Figure DEST_PATH_IMAGE011
Is the node set of the first graph, is>
Figure 559308DEST_PATH_IMAGE012
Is a set of nodes in the second graph), ->
Figure DEST_PATH_IMAGE013
Means that the current hidden state of node j at time step t is present>
Figure 759345DEST_PATH_IMAGE004
Means that the node i is currently hidden at time step t, <' >>
Figure 283867DEST_PATH_IMAGE014
In the formula, the hidden state of the nearest neighbor node corresponding to the node i in the first graph in the second graph at the time step t is shown, i and j refer to the node, and i and j refer to the edge.
Finally, generating low-dimensional graph embedded feature vectors of source codes through global graph pooling, as shown in FIG. 4, inputting a pair of feature graphs, mapping original node features and edge features into a feature initial vector by using a multilayer perceptron MLP (Multi-level perceptron), mapping a series of node feature vector sets into a new node feature vector set by using a propagation layer, updating the node feature vectors by using not only information of adjacent nodes in each graph but also cross-graph matching vectors in updating the node feature vectors by using the cross-graph matching vectors, wherein the cross-graph matching vectors describe the matching degree between the nodes in the current graph and the nodes in another graph, and an attention mechanism is added, and the low-dimensional graph embedded feature vectors of the source codes are generated by using the multi-level perceptron MLP (Multi-level perceptron MLP)
Figure DEST_PATH_IMAGE015
Representing the matching degree of the nodes in the graph i and the corresponding nodes in the graph j, simultaneously serving as the update weight of a node vector, after certain rounds of propagation, taking a node set of the feature graph as input by an aggregation layer, and calculating the embedded feature vector of the graph level through graph pooling; in this embodiment, the effects of the graph matching neural network (GMN) and the Gated Graph Neural Network (GGNN) are compared, and the graph matching neural network is finally selected, and the comparison result is shown in table 2:
TABLE 2 accuracy of neural networks of different graphs
Figure 834934DEST_PATH_IMAGE016
Step S30: inputting two graphs output by the graph matching network into a twin neural network designed in the embodiment, learning the two input feature vectors again by the twin neural network, and finally identifying and judging whether two source codes belong to the same author or not by calculating cosine distances, as shown in fig. 5;
the graph matching neural network in the embodiment is composed of an encoder, a propagation layer and an aggregation layer. As shown in fig. 4, the graph matching neural network is to compute and aggregate graph nodes to output graph-embedded feature vectors.
In this embodiment, the twin neural network is composed of three parts, two identical basic networks and a matcher, and as shown in fig. 5, the two identical basic networks are responsible for abstracting a high-level feature vector from input feature vectors, and the matcher is responsible for calculating a similarity score of the two high-level feature vectors.
Example 2:
with reference to fig. 1, the fingerprint identification system based on abstract syntax tree and graph neural network includes a source code feature diagram generation module, a graph matching neural network generator and a twin neural network discriminator, wherein:
the source code feature map generation module is used for generating an abstract syntax tree by preprocessing a source code and constructing a code feature map based on the abstract syntax tree, firstly eliminating comments in one source code and then normalizing tab characters, line feed characters, spaces and the like in the code, so that the influence of different code development environments on the identification of the identity of an author is prevented, the feature number is reduced, and the calculation efficiency is improved; secondly, the preprocessed source code is processed by using the proposed data enhancement AST, information such as a source code variable name and the like is extracted, an abstract syntax tree is generated, based on the abstract syntax tree, a token value is screened by using TF-IDF, characteristic dimensionality is reduced, and then a code characteristic graph is constructed by adding 8 different types of edges, wherein a Parent edge is connected with a non-root node and a Parent node of the root edge; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge is connected with the terminal node and the other terminal node; the Nextuse edge connects variable nodes and nodes appearing next time, information of functions, variable names, key tokens and data flow directions of programmers is extracted from the data flow edge in emphasis, in order to make up For the defect that AST only contains grammatical features, 3 edges of loop structure types suitable For most programming languages are added to extract control flow information of user codes, namely While edges, for edges and While edges, the edge construction is shown in figure 2, different control flow edges contain preferences of users on the loops in programmer fingerprint identification, loop types and frequencies used by users with different programming habits are different, and finally, for all types of edges, respective backward edges are introduced, so that the types and the number of the edges are doubled, the information entropy is increased, and the backward edges are beneficial to faster information propagation in the graph neural network.
The graph matching neural network generator is used for outputting graph embedding feature vectors and learning the characteristics of the fingerprints of the programmers according to the feature graph generated by the source code feature graph generating module as input;
and the twin neural network discriminator is used for taking two feature vectors (such as the feature vector of the anonymous code and the feature vector of a known programmer) generated by the graph matching neural network generator as input, and outputting similarity scores of the features through the base network (the deep neural network of the 4-layer fully-connected layer) and the discriminator. Two feature vectors indicating whether the anonymous source code is likely to be written by the known programmer. Since this architecture is not related to the number of classes in the dataset, it can be easily extended to new programmers.
The graph matching neural network generator and the twin neural network discriminator need to be trained in advance, the neural networks of the graph matching neural network generator and the twin neural network discriminator are respectively shown in fig. 4 and 5, a Python machine learning library Pythroch is adopted for a training platform, the set parameters are as follows, the batch training size is set to be 64, and the training round number i is set to be 200. For the graph matching neural network generator, the number of graph matching network layers is set to be 4, the graph pooling embedded vector dimension is 400, the learning Rate is 0.001, an adaptive moment estimation Adam (adaptive motion) optimizer is used, for the twin neural network arbiter, the base network is a DNN deep neural network with 4 hidden layers, the neurons of each layer are 400, 300, 200 and 100, the drop Rate Dropout Rate is set to be 0.2, and the Relu function is used as the activation function.
In the process of judging and identifying a pair of input feature vectors by a twin neural network discriminator, as shown in fig. 6, a pair of source code feature vectors generated by a graph matching neural network generator are used as input, a new pair of feature vectors are generated through a basic subnetwork of the twin neural network, then the discriminator is used for calculating the cosine distance of the two vectors to obtain a similarity score, whether the score is greater than a preset Threshold value is judged, if so, the two source codes belong to the same author, and if not, the two source codes belong to different authors.
The system processes the source code by using a method for constructing the feature graph created by us to generate the feature graph structure of the source code, and the method has the expandability of a programming language and the integrity of feature extraction, and not only comprises the traditional grammatical features, but also comprises the features of data, control flow and the like. And then, a graph neural network and a twin neural network are used for learning the characteristic graph and analyzing and judging the generated graph characteristic vector to identify the author identity, the graph neural network is more suitable for processing a non-Euclidean tree and graph structure of a source code abstract syntax tree and a characteristic graph than neural networks such as CNN, RNN and the like, the characteristics can be more effectively learned and extracted, in addition, the combination of the deep neural network can greatly reduce the dimension of the characteristics, and the speed and the accuracy of source code author identity identification are improved.
Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.

Claims (4)

1. A fingerprint identity recognition method based on an abstract syntax tree and a graph neural network is characterized by comprising the following steps:
s10, preprocessing a source code;
step S20, constructing an abstract syntax tree of the source code, wherein the abstract syntax tree structure contains all user-defined variable names, and constructing a code characteristic diagram by performing characteristic selection on token values of abstract syntax tree nodes and adding different types of edges, and the method specifically comprises the following steps:
s21, modifying an Abstract Syntax Tree (AST) generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code;
s22, screening the characteristics of an abstract syntax tree by using a word frequency-inverse text frequency index TF-IDF algorithm;
s23, adding edges representing control flow and data flow to reserve the user programming characteristics of the source code;
s24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a Hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges;
step S30, adopting the code characteristic diagram to train the neural network, generating the diagram embedding characteristic vector of the source code, and specifically comprising the following steps:
the graph matching neural network carries out cross-graph learning and updating on an input code feature graph pair with a label, the format of the code feature graph with the label is (G1, G2, label), if the code feature graph belongs to the same author, the label is 1, otherwise, the label is-1, and finally, a low-dimensional graph embedding feature vector of a source code is generated through graph pooling;
and S40, calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author.
2. The method for fingerprint identification based on abstract syntax tree and graph neural network of claim 1, wherein the step S10 comprises deleting comments in the source code and internal functions called by the source code, and normalizing line feed, tab and space in the source code.
3. The abstract syntax tree and graph neural network based fingerprint identification system for implementing the abstract syntax tree and graph neural network based fingerprint identification method of claim 1, comprising a source code feature map generating module, a graph matching neural network and a twin neural network, wherein:
the source code characteristic diagram generating module is used for preprocessing a source code, generating a code abstract syntax tree by using the data enhanced AST and constructing a code characteristic diagram of the source code by adding bidirectional characteristic edges and screening characteristic token values;
the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, calculating the matching degree of corresponding nodes in a pair of graphs as the weight of node updating, enabling the graph neural network to automatically learn the node characteristic information, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;
the twin neural network is used for generating two function embedding vectors according to the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through cosine distance and judging whether the two source codes belong to the same author;
the graph matching neural network consists of an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edge and node features to initial nodes and edge vectors by using a single multi-layer perceptron MLP, the propagation layer maps a group of nodes to new node representations through multiple rounds of learning and attention mechanisms, and the aggregation layer uses a graph node set as input to calculate feature graph embedding.
4. The fingerprint identification system based on abstract syntax tree and graph neural network of claim 3, wherein the twin neural network is composed of two identical basic networks and a matcher, the two identical basic networks are responsible for abstracting the high level feature vector from the input feature vector, and the matcher is responsible for calculating the similarity score of the two high level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a normalization function softmax layer.
CN202210782999.4A 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network Active CN115129364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210782999.4A CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210782999.4A CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Publications (2)

Publication Number Publication Date
CN115129364A CN115129364A (en) 2022-09-30
CN115129364B true CN115129364B (en) 2023-04-18

Family

ID=83380950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210782999.4A Active CN115129364B (en) 2022-07-05 2022-07-05 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Country Status (1)

Country Link
CN (1) CN115129364B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10528731B1 (en) * 2017-09-21 2020-01-07 Area 1 Security, Inc. Detecting malicious program code using similarity of hashed parsed trees
US10656940B1 (en) * 2019-02-04 2020-05-19 Architecture Technology Corporation Systems, devices, and methods for source code generation from binary files
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN112394973A (en) * 2020-11-23 2021-02-23 山东理工大学 Multi-language code plagiarism detection method based on pseudo-twin network
CN113360915A (en) * 2021-06-09 2021-09-07 扬州大学 Intelligent contract multi-vulnerability detection method and system based on source code graph representation learning
CN113961241A (en) * 2021-11-02 2022-01-21 南京大学 Code clone detection method based on GAT (generic antigen-based) graph neural network model
CN114327483A (en) * 2021-12-31 2022-04-12 华中科技大学 Graph tensor neural network model establishing method and source code semantic identification method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975392A (en) * 2016-04-29 2016-09-28 国家计算机网络与信息安全管理中心 Duplicated code detection method and device based on abstract syntax tree
CN107169358B (en) * 2017-05-24 2019-10-08 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint
CN108446540B (en) * 2018-03-19 2022-02-25 中山大学 Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN109445834B (en) * 2018-10-30 2021-04-30 北京计算机技术及应用研究所 Program code similarity rapid comparison method based on abstract syntax tree
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
US11243746B2 (en) * 2019-07-01 2022-02-08 X Development Llc Learning and using programming styles
CN110990273B (en) * 2019-11-29 2024-04-23 中国银行股份有限公司 Clone code detection method and device
CN113157917B (en) * 2021-03-15 2023-03-24 西北大学 OpenCL-based optimized classification model establishing and optimized classification method and system
CN113312268A (en) * 2021-07-29 2021-08-27 北京航空航天大学 Intelligent contract code similarity detection method
CN114547619B (en) * 2022-01-11 2024-04-19 扬州大学 Vulnerability restoration system and restoration method based on tree

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10528731B1 (en) * 2017-09-21 2020-01-07 Area 1 Security, Inc. Detecting malicious program code using similarity of hashed parsed trees
US10656940B1 (en) * 2019-02-04 2020-05-19 Architecture Technology Corporation Systems, devices, and methods for source code generation from binary files
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN112394973A (en) * 2020-11-23 2021-02-23 山东理工大学 Multi-language code plagiarism detection method based on pseudo-twin network
CN113360915A (en) * 2021-06-09 2021-09-07 扬州大学 Intelligent contract multi-vulnerability detection method and system based on source code graph representation learning
CN113961241A (en) * 2021-11-02 2022-01-21 南京大学 Code clone detection method based on GAT (generic antigen-based) graph neural network model
CN114327483A (en) * 2021-12-31 2022-04-12 华中科技大学 Graph tensor neural network model establishing method and source code semantic identification method

Also Published As

Publication number Publication date
CN115129364A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
Kim et al. Semantic sentence matching with densely-connected recurrent and co-attentive information
Ma et al. Joint embedding VQA model based on dynamic word vector
Niu et al. Multi-modal multi-scale deep learning for large-scale image annotation
Moon et al. Multimodal named entity disambiguation for noisy social media posts
Yu et al. Beyond Word Attention: Using Segment Attention in Neural Relation Extraction.
CN111506714A (en) Knowledge graph embedding based question answering
Peng et al. MAVA: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism
CN114548101B (en) Event detection method and system based on backtracking sequence generation method
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN115221325A (en) Text classification method based on label semantic learning and attention adjustment mechanism
CN113535897A (en) Fine-grained emotion analysis method based on syntactic relation and opinion word distribution
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
CN115982338A (en) Query path ordering-based domain knowledge graph question-answering method and system
Sui et al. Causality-aware enhanced model for multi-hop question answering over knowledge graphs
Yang et al. Semantic-preserving adversarial text attacks
Zhu et al. Configurable graph reasoning for visual relationship detection
CN117018632A (en) Game platform intelligent management method, system and storage medium
CN115129364B (en) Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
CN116595982A (en) Nested named entity identification method based on dynamic graph convolution
Jung et al. Improving visual relationship detection using linguistic and spatial cues
CN115759043A (en) Document-level sensitive information detection model training and prediction method
CN113254575B (en) Machine reading understanding method and system based on multi-step evidence reasoning
CN115359486A (en) Method and system for determining custom information in document image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant