CN115129364B

CN115129364B - Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Info

Publication number: CN115129364B
Application number: CN202210782999.4A
Authority: CN
Inventors: 张磊; 郭迪骁; 刘亮; 陶锐
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2023-04-18
Anticipated expiration: 2042-07-05
Also published as: CN115129364A

Abstract

The invention discloses a fingerprint identity recognition method and a fingerprint identity recognition system based on an abstract syntax tree and a graph neural network, which relate to the network space security technology and comprise the steps of preprocessing a source code; constructing an abstract syntax tree of a source code, performing feature selection on token values of nodes of the abstract syntax tree, and adding edges of different types to construct a code feature graph; adopting a code characteristic diagram training diagram to match a neural network, and generating a diagram embedding characteristic vector of a source code; and (3) calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author. The method combines an abstract syntax tree and a graph neural network technology to generate an abstract syntax tree of a source code, constructs a source code characteristic graph by adding side information and screening token values, generates a graph embedding characteristic vector of the characteristic graph by using a graph neural network with an attention adding mechanism, and identifies and judges the author identity of the source code through a twin neural network.

Description

Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Technical Field

The invention relates to the technical field of network space security, in particular to a fingerprint identity recognition method and system based on an abstract syntax tree and a graph neural network.

Background

Source code authorship identifies a given code authorship. As the number of malware increases and mutation techniques evolve, malware authors are creating a large number of malware variants. To better address this problem, a method of checking the identity of the malicious code author is necessary. The source code author identity attribution technology mainly comprises author feature extraction and identity identification. In the prior art, a natural language processing method of a pure text is generally adopted to extract the programming characteristics of an author by using a language model N-gram, the method needs a large amount of time, the extracted code characteristics have no robustness, and the identification accuracy is low. In order to accurately and efficiently extract identity characteristics of a source code, some methods propose that source code characteristic extraction is performed by introducing a manual characteristic set consisting of layout, vocabulary and syntactic characteristics, the characteristics can resist code formatting and confusion technologies, but the vocabulary characteristics and the syntactic characteristics are separated, the two characteristics are closely related in the source code, the separated extraction can cause extraction of some wrong or useless characteristics, the recognition accuracy is reduced, common keywords such as function print appear in all samples, the extraction of the characteristics does not help the recognition accuracy, and the characteristic quantity extracted in the way is large and cannot be well expanded into a large author data set; the method adopts a Word frequency-inverse text frequency index TF-IDF or a Word to vector Word2Vec and other methods to process the source code, so that control flow and data flow syntactic characteristics can be ignored, and the traditional deep neural network cannot well process the non-Euclidean structure of the source code, because the size of the non-Euclidean structure of the source code is arbitrary, the topological structure is complex, no spatial locality is the same as an image, no fixed node sequence exists, the operations such as convolution and the like of the traditional deep learning cannot be used on the source code, the source code can be trained only after being converted into an Euclidean structure such as a Word frequency vector and the like, and partial characteristics can be lost in the conversion process, so that the problem of inaccurate identity characteristics of the source code is solved.

Disclosure of Invention

The invention aims to provide a fingerprint identity recognition method and a fingerprint identity recognition system based on an abstract syntax tree and a graph neural network, which are used for solving the problems of extraction errors or useless features caused by separation of close features, large extracted feature quantity and inaccurate extracted source code identity features in the method for recognizing the source code identity features in the prior art.

The invention solves the problems through the following technical scheme:

a fingerprint identity recognition method based on an abstract syntax tree and a graph neural network comprises the following steps:

s10, preprocessing a source code;

s20, constructing an abstract syntax tree of the source code, and constructing a code characteristic diagram by performing characteristic selection on token values of nodes of the abstract syntax tree and adding different types of edges;

s30, matching a neural network by adopting a code characteristic diagram training diagram to generate a diagram embedding characteristic vector of a source code;

and S40, calculating and judging two graph embedding characteristic vectors output by the graph matching neural network by adopting a twin neural network, and identifying whether two source codes belong to the same author.

The step S10 comprises deleting comments in the source code and internal functions called by the source code, and line feed characters, tab characters and spaces in the normalized source code;

the step S20 includes:

s21, modifying an AST generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code; AST (Abstract Syntax Tree) is Abstract Syntax Tree;

s22, screening the characteristics of an abstract syntax tree by using a word frequency-inverse text frequency index TF-IDF algorithm;

step S23, adding edges representing control and data flows, to preserve the user-programmed features of the source code,

and S24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a Hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.

Adding Parent, child, sitting, token, nextuse, if, while, for the control flow edge, extracting the preference of the user to the loops, and the loop types and frequencies used by the users with different programming habits are different to represent the edges of the control flow and the data flow to reserve the user programming characteristics of the source code; the information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the information of the number of parameters, the type of the parameters, the transmission of the variable and other information of the function is mainly reserved, and the reverse edge is added to all the edges to generate the source code characteristic diagram.

The step S30 specifically includes: the graph matching neural network carries out cross-graph learning and updating on input labeled code feature graph pairs, the format of the labeled feature graph is (G1, G2, label), if the code feature graphs belong to the same author, the label is 1, otherwise, the label is-1, and finally, low-dimensional graph embedding feature vectors of the source code are generated through graph pooling.

The fingerprint identity recognition system based on the abstract syntax tree and the graph neural network comprises a source code feature graph generation module, the graph matching neural network and a twin neural network, wherein:

the source code characteristic graph generating module is used for preprocessing a source code, generating a code abstract syntax tree by utilizing the proposed data enhanced AST and constructing a code characteristic graph of the source code by adding bidirectional characteristic edges and screening characteristic tokens;

the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;

and the twin neural network is used for generating two function embedded vectors from the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through the cosine distance, and judging whether the two source codes belong to the same author.

The graph matching neural network is composed of an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edge and node features to initial nodes and edge vectors by using a single multilayer perceptron MLP, the propagation layer maps a group of nodes to new node representations through multiple rounds of learning and attention mechanisms, and the aggregation layer uses a graph node set as input to calculate feature graph embedding.

The twin neural network consists of two identical basic networks and a matcher, wherein the two identical basic networks are used for abstracting high-level feature vectors from the input feature vectors, and the matcher is used for calculating the similarity score of the two high-level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a softmax layer.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) The method combines an abstract syntax tree and a graph neural network technology, uses the abstract syntax tree of the data enhanced AST to generate the source code, constructs a source code feature graph by adding side information and screening tokens, greatly reduces the number of features which are useless for the author identity recognition, increases the data flow and control flow features, uses the graph neural network with an attention mechanism to generate a graph embedding feature vector of the feature graph, increases the training purpose, focuses on the programming features which are more effective for the author identity recognition, and recognizes and judges the author identity of the source code through a twin neural network.

(2) The method not only extracts the syntactic and semantic characteristics of the source code, but also extracts the characteristics of data flow and control flow, and under the condition that the author and the code sample data are consistent, the identification accuracy rate of the method is superior to that of other methods.

(3) The feature extraction method for constructing the code feature graph can be expanded to other programming languages, and has good expandability.

(4) The invention realizes the de-anonymization of the source code, and can be applied to the application fields of malicious code author identity tracing detection, copyright dispute, plagiarism and the like.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a block diagram of a source code feature map generation module;

FIG. 3 is a diagram of an abstract syntax tree derived from source code;

FIG. 4 is a block diagram of a graph matching neural network;

FIG. 5 is a block diagram of a twin neural network;

FIG. 6 is a flow chart of a code author identification process.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

referring to fig. 1, a fingerprint identification method based on an abstract syntax tree and a graph neural network includes:

step S10: preprocessing source codes to eliminate the influence of different integrated development environments, converting the source codes into a tree structure based on an abstract syntax tree, and constructing a code characteristic diagram by performing characteristic selection on token values of tree nodes and adding different types of edges;

in this embodiment, google global programming challenge game 2008-2020 is collected and used for all data sets filed by Google Code Jam, including important information such as game questions, participants, participant submission codes, code types and the like, and Google Code Jam (GCJ) is an international programming competition sponsored by Google because the participants of the GCJ game almost include programmers of different education levels in all countries and regions, and can well simulate real scenes without anonymization. The step S10 specifically includes:

step S11: in the embodiment, two preprocessing modes are respectively performed on a source code, one is to extract layout characteristics of the source code, the other is to perform normalization processing on the source code, delete comments and called internal functions of the source code, and normalize line feed characters, tab characters, spaces and the like in the code, and considering that different code integration development environments used by different programmers in a real programming environment are different, and different code layouts and comments are also information belonging to identity characteristics of the programmer, a method for extracting the layout characteristics of the code is designed, and the extracted layout characteristics of the code are shown in table 1:

TABLE 1 layout characteristics

Experimental results show that the recognition accuracy of the layout features and the AST-based graph pooling features in the programmer recognition work of three scales of 50, 100 and 1000 is basically the same as that of the AST-based graph pooling features which are used alone, but the required training time and training cost are increased, so that the embodiment finally selects to delete comments in the source code and internal functions called by the source code in advance, standardize line feed characters, tab characters, blank spaces and the like in the code, eliminate the influence of different integrated development environments, reduce the number of features and improve the calculation efficiency;

step S12: modifying an AST generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code;

because the characteristics of the programmer fingerprint need to be extracted, the characteristics of the variable name, the function name, the number of function parameters, and the like in the source code need to be captured, so in this embodiment, a new method for generating the AST is adopted, which is referred to as data-enhanced AST (data-augmented AST), a conventional AST generation module only generates a corresponding class for a variable, a constant, and the like, for example, in an AST module of python, only a variable class name is returned for a variable a, and the variable name a only serves as a value of the class name. As shown in fig. 3, the left side is a simple input source code, the right side is a converted abstract syntax tree structure, the left source code defines an add function for calculating and returning the value of the parameter a plus the parameter b, and then calls the add function parameter 2,3; the AST root node on the right side is a Module, the left sub-tree of the root node is a function definition part which comprises function, entries and return nodes, parameters a and b are used as child nodes respectively, and the right sub-tree of the root node is a function calling part which comprises call, funcname, input data Num (2) and Num (3) nodes.

Step S13: the AST tree (abstract syntax tree) structure generated by the invention contains all user-defined variable names, under the condition of a large sample data set, the number of tokens is overlarge, approximately 20000 token values are generated in the data set of 4000 codes in total of 100 authors, in order to prevent overfitting and reduce training cost, a TF-IDF algorithm is used for screening features, and the token value is finally selected to be 8000 through multiple experiments. TF-IDF is a statistical method to assess how important a word is to one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

Step S14: the method is characterized in that Parent, child, sitting, token, nextuse, if, while and For representing the edges of control flow and data flow are added to reserve the user programming characteristics of a source code, wherein aiming at the control flow edge, only three cyclic control edges of If, while and For are selected in the embodiment, which are cyclic structures commonly contained by C/C + +, python and Java, and the cross-programming language recognition function of programmer fingerprint recognition can be realized. In addition, the control flow edge contains the user's preferences for these several loops when writing code, and the types and frequency of loops used by users with different programming habits are different, which is important for programmer fingerprint characterization. The information of the key token and the flow direction of the variable are reserved for the data flow edge, for example, the features of the number of parameters, the type of parameters, the transfer information of the variable and the like of the function are mainly reserved. Wherein, the Parent edge connects the non-root node and the Parent node; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge connects the terminal node and another terminal node; the Nextuse edge connects the variable node and its next occurring node. In this embodiment, the reverse sides are added to all the sides, the effects of the reverse sides are tested, feature maps with and without the addition of the reverse sides are respectively constructed in the data set using 100 programmers in step S11, and training is performed using the GMN map matching neural network. Experiments show that the identification accuracy rate and the loss value, namely the loss value tend to be consistent when the round epoch is 20 in the characteristic diagram with the added reverse edge, and the identification accuracy rate and the loss value tend to be consistent when the round epoch is 30 in the characteristic diagram without the added reverse edge. Experimental results show that the convergence rate of the graph matching model can be increased by the reverse edge, the same data set is realized, the epoch value required by the model with the reverse edge to achieve the highest accuracy is smaller, and the model training efficiency can be improved. In this embodiment, a source code feature map is generated through the above steps, and as shown in fig. 2, for an If loop structure in an abstract syntax tree, we add a feature edge of condtrose in the Then executed when the loop condition Conditon and the condition are satisfied, add a reverse ForNext feature edge, and add a feature edge of Condfalse in the Else executed when the loop condition Conditon and the condition are not satisfied; for the While loop structure in the abstract syntax tree, adding a WhileExec characteristic edge and an inverse WhileExxt characteristic edge in the loop condition Conditon and the loop subject Body; for the For cycle structure in the abstract syntax tree, adding a ForExec characteristic edge and a reverse ForNext characteristic edge in the cycle condition ForControl and the cycle Body; for the sequential execution structure in the abstract syntax tree, the feature edge of NextStmt is added between Statement of sequential execution.

Step S15: and converting the edges and the nodes into embedded vectors through hashing, and initializing the nodes of the source code feature graph and the embedded vectors of the edges.

Step S20: using the source code feature pattern corpus generated in step S10 to train a graph matching neural network with an attention mechanism, wherein the graph matching network performs cross-graph learning and updating on an input labeled sample pair, the labeled code feature pattern is in a format of (G1, G2, label), if the code feature patterns belong to the same author, label is 1, otherwise-1, the attention mechanism enables the graph neural network to find the most important nodes and edge features in the programmer feature pattern in the cross-graph training and learning, and the attention value calculation formula is as follows:

wherein

Is the current hidden state at time step t, <' > is>

Is a GRU unit, is operative to determine whether a GRU unit is present>

Is a vector similarity function, which in this embodiment refers to cosine similarity, based on the value of the inverse cosine function->

Is indicated in>

The difference between the node of the next first graph and its nearest neighbor in another graph. />

Represents an attention weight, is asserted>

Represents->

The difference with its nearest neighbor node in another graph, is->

，/>

Is the node set of the first graph, is>

Is a set of nodes in the second graph), ->

Means that the current hidden state of node j at time step t is present>

Means that the node i is currently hidden at time step t, <' >>

In the formula, the hidden state of the nearest neighbor node corresponding to the node i in the first graph in the second graph at the time step t is shown, i and j refer to the node, and i and j refer to the edge.

Finally, generating low-dimensional graph embedded feature vectors of source codes through global graph pooling, as shown in FIG. 4, inputting a pair of feature graphs, mapping original node features and edge features into a feature initial vector by using a multilayer perceptron MLP (Multi-level perceptron), mapping a series of node feature vector sets into a new node feature vector set by using a propagation layer, updating the node feature vectors by using not only information of adjacent nodes in each graph but also cross-graph matching vectors in updating the node feature vectors by using the cross-graph matching vectors, wherein the cross-graph matching vectors describe the matching degree between the nodes in the current graph and the nodes in another graph, and an attention mechanism is added, and the low-dimensional graph embedded feature vectors of the source codes are generated by using the multi-level perceptron MLP (Multi-level perceptron MLP)

Representing the matching degree of the nodes in the graph i and the corresponding nodes in the graph j, simultaneously serving as the update weight of a node vector, after certain rounds of propagation, taking a node set of the feature graph as input by an aggregation layer, and calculating the embedded feature vector of the graph level through graph pooling; in this embodiment, the effects of the graph matching neural network (GMN) and the Gated Graph Neural Network (GGNN) are compared, and the graph matching neural network is finally selected, and the comparison result is shown in table 2:

TABLE 2 accuracy of neural networks of different graphs

Step S30: inputting two graphs output by the graph matching network into a twin neural network designed in the embodiment, learning the two input feature vectors again by the twin neural network, and finally identifying and judging whether two source codes belong to the same author or not by calculating cosine distances, as shown in fig. 5;

the graph matching neural network in the embodiment is composed of an encoder, a propagation layer and an aggregation layer. As shown in fig. 4, the graph matching neural network is to compute and aggregate graph nodes to output graph-embedded feature vectors.

In this embodiment, the twin neural network is composed of three parts, two identical basic networks and a matcher, and as shown in fig. 5, the two identical basic networks are responsible for abstracting a high-level feature vector from input feature vectors, and the matcher is responsible for calculating a similarity score of the two high-level feature vectors.

Example 2:

with reference to fig. 1, the fingerprint identification system based on abstract syntax tree and graph neural network includes a source code feature diagram generation module, a graph matching neural network generator and a twin neural network discriminator, wherein:

the source code feature map generation module is used for generating an abstract syntax tree by preprocessing a source code and constructing a code feature map based on the abstract syntax tree, firstly eliminating comments in one source code and then normalizing tab characters, line feed characters, spaces and the like in the code, so that the influence of different code development environments on the identification of the identity of an author is prevented, the feature number is reduced, and the calculation efficiency is improved; secondly, the preprocessed source code is processed by using the proposed data enhancement AST, information such as a source code variable name and the like is extracted, an abstract syntax tree is generated, based on the abstract syntax tree, a token value is screened by using TF-IDF, characteristic dimensionality is reduced, and then a code characteristic graph is constructed by adding 8 different types of edges, wherein a Parent edge is connected with a non-root node and a Parent node of the root edge; the Child edge connects the non-leaf node and its Child nodes; the sitting edge connects a node and a brother node thereof, and the sequence is from left to right; the Token edge is connected with the terminal node and the other terminal node; the Nextuse edge connects variable nodes and nodes appearing next time, information of functions, variable names, key tokens and data flow directions of programmers is extracted from the data flow edge in emphasis, in order to make up For the defect that AST only contains grammatical features, 3 edges of loop structure types suitable For most programming languages are added to extract control flow information of user codes, namely While edges, for edges and While edges, the edge construction is shown in figure 2, different control flow edges contain preferences of users on the loops in programmer fingerprint identification, loop types and frequencies used by users with different programming habits are different, and finally, for all types of edges, respective backward edges are introduced, so that the types and the number of the edges are doubled, the information entropy is increased, and the backward edges are beneficial to faster information propagation in the graph neural network.

The graph matching neural network generator is used for outputting graph embedding feature vectors and learning the characteristics of the fingerprints of the programmers according to the feature graph generated by the source code feature graph generating module as input;

and the twin neural network discriminator is used for taking two feature vectors (such as the feature vector of the anonymous code and the feature vector of a known programmer) generated by the graph matching neural network generator as input, and outputting similarity scores of the features through the base network (the deep neural network of the 4-layer fully-connected layer) and the discriminator. Two feature vectors indicating whether the anonymous source code is likely to be written by the known programmer. Since this architecture is not related to the number of classes in the dataset, it can be easily extended to new programmers.

The graph matching neural network generator and the twin neural network discriminator need to be trained in advance, the neural networks of the graph matching neural network generator and the twin neural network discriminator are respectively shown in fig. 4 and 5, a Python machine learning library Pythroch is adopted for a training platform, the set parameters are as follows, the batch training size is set to be 64, and the training round number i is set to be 200. For the graph matching neural network generator, the number of graph matching network layers is set to be 4, the graph pooling embedded vector dimension is 400, the learning Rate is 0.001, an adaptive moment estimation Adam (adaptive motion) optimizer is used, for the twin neural network arbiter, the base network is a DNN deep neural network with 4 hidden layers, the neurons of each layer are 400, 300, 200 and 100, the drop Rate Dropout Rate is set to be 0.2, and the Relu function is used as the activation function.

In the process of judging and identifying a pair of input feature vectors by a twin neural network discriminator, as shown in fig. 6, a pair of source code feature vectors generated by a graph matching neural network generator are used as input, a new pair of feature vectors are generated through a basic subnetwork of the twin neural network, then the discriminator is used for calculating the cosine distance of the two vectors to obtain a similarity score, whether the score is greater than a preset Threshold value is judged, if so, the two source codes belong to the same author, and if not, the two source codes belong to different authors.

The system processes the source code by using a method for constructing the feature graph created by us to generate the feature graph structure of the source code, and the method has the expandability of a programming language and the integrity of feature extraction, and not only comprises the traditional grammatical features, but also comprises the features of data, control flow and the like. And then, a graph neural network and a twin neural network are used for learning the characteristic graph and analyzing and judging the generated graph characteristic vector to identify the author identity, the graph neural network is more suitable for processing a non-Euclidean tree and graph structure of a source code abstract syntax tree and a characteristic graph than neural networks such as CNN, RNN and the like, the characteristics can be more effectively learned and extracted, in addition, the combination of the deep neural network can greatly reduce the dimension of the characteristics, and the speed and the accuracy of source code author identity identification are improved.

Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.

Claims

1. A fingerprint identity recognition method based on an abstract syntax tree and a graph neural network is characterized by comprising the following steps:

s10, preprocessing a source code;

step S20, constructing an abstract syntax tree of the source code, wherein the abstract syntax tree structure contains all user-defined variable names, and constructing a code characteristic diagram by performing characteristic selection on token values of abstract syntax tree nodes and adding different types of edges, and the method specifically comprises the following steps:

s21, modifying an Abstract Syntax Tree (AST) generation module, adding variable name information to obtain a data enhanced AST, analyzing a source code by using the data enhanced AST, and generating an abstract syntax tree of the source code;

s23, adding edges representing control flow and data flow to reserve the user programming characteristics of the source code;

s24, adding reverse edges to all the edges to generate a source code feature graph, converting the edges and the nodes into embedded vectors through a Hash algorithm, and initializing the nodes of the source code feature graph and the embedded vectors of the edges;

step S30, adopting the code characteristic diagram to train the neural network, generating the diagram embedding characteristic vector of the source code, and specifically comprising the following steps:

the graph matching neural network carries out cross-graph learning and updating on an input code feature graph pair with a label, the format of the code feature graph with the label is (G1, G2, label), if the code feature graph belongs to the same author, the label is 1, otherwise, the label is-1, and finally, a low-dimensional graph embedding feature vector of a source code is generated through graph pooling;

2. The method for fingerprint identification based on abstract syntax tree and graph neural network of claim 1, wherein the step S10 comprises deleting comments in the source code and internal functions called by the source code, and normalizing line feed, tab and space in the source code.

3. The abstract syntax tree and graph neural network based fingerprint identification system for implementing the abstract syntax tree and graph neural network based fingerprint identification method of claim 1, comprising a source code feature map generating module, a graph matching neural network and a twin neural network, wherein:

the source code characteristic diagram generating module is used for preprocessing a source code, generating a code abstract syntax tree by using the data enhanced AST and constructing a code characteristic diagram of the source code by adding bidirectional characteristic edges and screening characteristic token values;

the graph matching neural network is used for updating and learning the parameters of the model according to the input code characteristic graph with the label, adding an attention mechanism, calculating the matching degree of corresponding nodes in a pair of graphs as the weight of node updating, enabling the graph neural network to automatically learn the node characteristic information, performing graph pooling embedding operation on the input graph structure after learning is finished, and outputting a one-dimensional code characteristic vector;

the twin neural network is used for generating two function embedding vectors according to the two input code characteristic vectors, calculating the similarity between the two code characteristic vectors through cosine distance and judging whether the two source codes belong to the same author;

the graph matching neural network consists of an encoder, a propagation layer and an aggregation layer, wherein the encoder maps edge and node features to initial nodes and edge vectors by using a single multi-layer perceptron MLP, the propagation layer maps a group of nodes to new node representations through multiple rounds of learning and attention mechanisms, and the aggregation layer uses a graph node set as input to calculate feature graph embedding.

4. The fingerprint identification system based on abstract syntax tree and graph neural network of claim 3, wherein the twin neural network is composed of two identical basic networks and a matcher, the two identical basic networks are responsible for abstracting the high level feature vector from the input feature vector, and the matcher is responsible for calculating the similarity score of the two high level feature vectors; the basic network consists of four full connection layers, and the matcher consists of a subtraction layer, a full connection layer and a normalization function softmax layer.