CN116663004A

CN116663004A - Binary function similarity detection method and system based on graph transformations

Info

Publication number: CN116663004A
Application number: CN202310931335.4A
Authority: CN
Inventors: 张云; 刘玉玲
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-08-29
Anticipated expiration: 2043-07-27
Also published as: CN116663004B

Abstract

The invention provides a binary function similarity detection method and system based on graph transformations. The method comprises the following steps: carrying out static analysis on the binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node; normalizing the operation code of the assembly instruction; encoding each node in the CFG through an unsupervised contrast learning method to generate a feature vector of each node; extracting structural information of the CFG by using a graph Transformers model to generate graph embedding; and embedding the feature vectors of the two binary functions, namely the graphs, into a connection input full connection layer to obtain the probability that the two binary functions are similar. The method and the device for detecting the similarity of the binary functions by using the graph convectors can process complex function structures, can adapt to a large-scale binary code library, and have higher accuracy, efficiency and expandability.

Description

Binary function similarity detection method and system based on graph transformations

Technical Field

The invention relates to the technical field of information security, in particular to a binary function similarity detection method and system based on graph transformations.

Background

Binary analysis is an important component of many security applications, such as malware detection, binary vulnerability discovery, and software copyright protection. To extract meaningful information from binary code, researchers typically translate functions of binary code into graph structures, such as Control Flow Graphs (CFGs). They then use a graph matching or graph isomorphism algorithm to compare if the two functions are similar.

Such as: chinese patent document CN202011367342.9 discloses a binary similarity detection method based on a graph automatic encoder, comprising preprocessing two binary functions to be compared; obtaining control dependency relationship between basic blocks in two binary functions by using a disassembly tool and an interface provided by the disassembly tool; recording control flow diagrams in two binary functions, extracting the digital characteristics of nodes of the control flow diagrams, focusing on the calling relation between functions, and calling the characteristics of other functions of the functions in a function entry basic block; encoding the obtained control flow graph by using a graph automatic encoder GAE to obtain an embedded vector describing the graph structure; after fusing the embedded vectors of the characterization graph structures of the two binary functions and the digital characteristic information of the nodes, calculating a training model through a loss function; the trained model can determine whether the given two binary functions are similar. According to the scheme, the semantic structure features and the node semantic features in the functions are reserved to the greatest extent, so that the accuracy of similarity comparison is improved.

Chinese patent document CN202110690580.1 discloses a method for similarity analysis of intelligent contract binary functions, comprising decompiling byte codes to generate EVM instructions and corresponding parameters; reconstructing a control flow graph CFG according to the decompiled EVM instruction; dividing a CFG of a contract into a plurality of binary functions and determining timing relationships for edges in the CFG; extracting characteristic values and a graph structure; a model based on a time sequence aggregation diagram structure is designed, and the similarity of two binary functions can be obtained by comparing the aggregated diagram structure. The scheme is mainly aimed at similarity analysis of intelligent contract binary functions, and can process most contracts lacking source codes by directly researching byte codes of the contracts.

Another example is: chinese patent document CN202110607066.7 discloses a binary function similarity detection method of fusion influencing factors, which comprises the steps of firstly preprocessing two binary functions to obtain control flow charts (CFG 1, CFG 2) of the two binary functions; then, extracting the characteristics of each basic block in the CFG, representing the basic blocks into characteristic vectors, and generating corresponding attribute control flow graphs (ACFG 1 and ACFG 2); the attribute control flow diagrams ACFG1, ACFG2 of the two functions are then input into two identical graph embedded networks, which are converted into corresponding high-dimensional vectors. And (3) calculating cosine distances of the two high-dimensional vectors by minimizing parameters of the target function training diagram embedded in the network, and outputting the similarity of the two binary functions. The scheme is mainly used for solving the problem of information loss caused by neglecting different influences of the successor node and the neighbor node on the vertex in the binary function similarity detection method based on graph embedding.

However, none of the above existing binary function similarity detection methods can effectively deal with the complexity of functions and large-scale binary code libraries and over-rely on datasets, lacking scalability.

Disclosure of Invention

The invention aims to solve the technical problems that: aiming at the defects in the prior art, a binary function similarity detection method and system based on graph transformations are provided. The structure information of the CFG is extracted through the graph converters model, graph embedding is generated, and similarity detection of the binary function is realized, so that a complex function structure is handled correspondingly, a large-scale binary code library can be adapted, and the method has high detection accuracy, efficiency and expandability.

In order to solve the technical problems, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a binary function similarity detection method based on graph transformations, including the following steps:

s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node;

s2, normalizing the operation code of the assembly instruction in the assembly instruction section;

s3, coding each node in the control flow graph through an unsupervised contrast learning method to generate a feature vector of each node;

s4, extracting structural information of the control flow graph by using a graph Transformers model, and aggregating characteristic information of nodes and the structural information of the control flow graph to generate graph embedding;

and S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar.

Further, the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:

replacing all memory addresses with "mem";

renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;

replacing all numerical constants with "num";

replacing all character strings with str;

the function names and non-library functions of all non-system functions are replaced with "fun".

Further, step S3 encodes each node in the control flow graph, extracts semantic information of the instruction, and generates a feature vector of the node, which specifically includes:

s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,in the representation of the firstThe number of assembler instruction segments, m, represents the number of assembler instruction segments;

s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discarding masks and attention probabilities on the full connection layer, so as to obtain two encoders with different discarding masksAnda kind of electronic deviceIs embedded in the mold; namely:

，；

wherein ,、is two different discarded random masks,、respectively discarding masksA kind of electronic deviceIs provided with an embedding in the substrate,is a node encoder;

has the following characteristics ofThe training targets of the small batch of pairs are:

；

wherein ,represent the firstThe penalty value of the individual assembler instruction segment,representing the index in a small batch of material,to discard masksA kind of electronic deviceIs provided with an embedding in the substrate,is used for the temperature super-parameter,as a cosine similarity function.

Further, step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.

Further, the method for adding the encoded information in step S4 includes:

one approach adds a new position code in the input of the Transformers model to help encode distance awareness information;

alternatively, a scheme adds a new spatial code as a bias term in the self-attention module of the Transformers model to accurately capture the spatial dependencies in the graph;

alternatively, the coding information is added by adopting the two methods of adding a new position code into the input of the Transformers and adding a new space code into the self-attention module of the Transformers as a deviation term.

Furthermore, a method for adding a new position code to the input of the Transformers model is to encode node position information for any graph by using the laplace feature vector, which is specifically as follows:

(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:

；

wherein ,is a matrix of units which is a matrix of units,as an adjacency matrix of the graph,in the form of a degree matrix,、respectively corresponding to the eigenvalues and eigenvectors of the laplace matrix,representing a Laplace matrix of the graph;

(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. atlas)Matrix of the LasThe smallest non-trivial feature vector; the specific method comprises the following steps:

first, for each node in the control flow graphNode feature vector of (a)Embedding these into a linear projectionDimension hiding featuresIn (a):

wherein ,andis embedded intoMaintaining linear projection layer parameters of hidden features;

pre-computed node position codes are then embedded into the linear projectionDimension feature vectorOn, and add to node featuresGenerating new node characteristics；

；

wherein ,representing the position-coded feature vector after linear projection,andfor being embedded intoLinear projection parameters of the dimensional feature vector.

Still further, the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model includes assigning a learnable embedment to each node pair based on their spatial relationship and defining a function based on connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:

for any figureSetting a functionFor measuring diagramsMiddle nodeAnda spatial relationship between; function ofThrough the graphConnectivity between the intermediate nodes;

if two nodes are connected, thenIs a nodeAnda shortest path therebetween;

if two nodes are not connected, thenThe output value of (2) is set to a particular value；

Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key productsThe elements are as follows:

；

wherein ,andas a matrix of weights, the weight matrix,is thatIs used for the matrix transpose of (a),for the normalization of the terms,andrepresenting nodesAndis used for the feature vector of (a),is composed ofThe learnable scalar of the index is shared among all layers.

Further, step S5 embeds feature vectors of two binary functions, i.e. graphs of the control flow graph, and performs connection input to the full connection layer to obtain probabilities that the two binary functions are similar, specifically as follows:

；

wherein ,indicating that the full-link layer is to be formed,andas a feature vector of the function,the connection is represented by a representation of the connection,probability that the two functions are similar.

In a second aspect, the present invention also provides a binary function similarity detection system based on graph transformations, comprising a microprocessor and a memory, which are interconnected, the microprocessor being programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations described above.

Further, the binary function similarity detection system based on graph convectors specifically comprises:

the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;

the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;

the semantic perception module is used for encoding each node in the control flow graph through an unsupervised comparison learning method, generating a feature vector of each node and extracting semantic information of an assembly instruction;

the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;

and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.

In the invention, the following components are added:

transgramers: the neural network structure is mainly used for processing sequence data in the field of deep learning. The core idea is to use a Self-attention mechanism (Self-Attention Mechanism) to capture dependencies in the sequence, without relying on traditional recursive or convolutional structures. The characteristics are as follows: 1) Self-attention mechanism: the kernel component of the Transformers allows the model to assign a different attention weight to each element in the sequence. 2) Position coding: the order information of the elements in the model sequence is given by position coding. 3) Multi-head attention: each transducer layer contains multiple attention "heads" and the model can focus on multiple positions in the sequence simultaneously. 4) Feedforward neural network: after each transducer layer, there is a feed-forward neural network for further processing the self-attention output.

Graph transformations: graph Transformers is a deep learning model architecture based on convectors, and aims to process graph structure data. The method combines the self-attention mechanism of the traditional Transformers architecture and special processing skills for graph data, thereby being capable of capturing complex structures and relations in the captured graph. The characteristics are as follows: 1) Self-attention and graph structure: unlike conventional transformations, the self-attention mechanism of graph transformations is to consider the relationships of nodes in the graph, enabling the model to focus on other nodes that are related to or have an impact on the current node. 2) Position coding: the structural context of the node in the graph or its specific topological properties are provided.

The invention has the beneficial effects that:

compared with the prior art, the binary function similarity detection method has the advantages that the binary function similarity detection is performed by using the graph converters, so that complex function structures can be processed, the method can be suitable for a large-scale binary code library, and the detection accuracy and efficiency are high.

The method can be widely applied to multiple fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a basic flow diagram of a binary function similarity detection method according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to examples and figures, which are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1, the present embodiment provides a binary function similarity detection method based on graph transformations, which includes the following steps:

s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node.

S2, normalizing the operation code of the assembly instruction in the assembly instruction section; the method comprises the following steps:

replacing all memory addresses with "mem";

replacing all numerical constants with "num";

replacing all character strings with str;

S3, coding each node in the control flow graph through an unsupervised contrast learning method, extracting semantic information of an instruction, and generating a feature vector of each node; the method comprises the following steps:

，；

；

And S4, extracting structural information of the control flow graph by using a graph converters model, aggregating characteristic information of nodes and structural information of the control flow graph, and generating graph embedding. The method specifically comprises the following steps: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.

The method for adding the coding information comprises the following steps:

adding a new position code in the input of the Transformers model to help encode the distance sensing information;

adding a new spatial code as a deviation term in a self-attention module of a Transformers model so as to accurately capture the spatial dependence in the graph;

The method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplacian eigenvector, and specifically comprises the following steps:

；

(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. a graph Laplace matrix)The smallest non-trivial feature vector; the specific method comprises the following steps:

；

The method for adding new space codes in the self-attention module of the Transformers model as deviation items comprises the steps of distributing a leachable embedding for each node pair according to the space relation of the node pairs, and defining a function according to the connectivity between the nodes in the graph to measure the space relation between the two nodes in the graph; the method comprises the following steps:

if two nodes are connected, thenIs a nodeAndshortest Path (SPD) between;

；

And S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar. The method comprises the following steps:

；

Example 2

The present embodiment provides a binary function similarity detection system based on graph transformations, which includes a microprocessor and a memory connected to each other, where the microprocessor is programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations. The method specifically comprises the following steps:

Based on the same inventive concept, this embodiment is a system embodiment corresponding to the above method embodiment, and may be implemented in cooperation with the above implementation manner. The related technical details mentioned in the above embodiments are still valid in this embodiment, and the repetition is not repeated.

The method provided by the embodiment of the invention not only can process complex function structures, but also can adapt to a large-scale binary code library, and has higher detection accuracy and efficiency. The embodiment of the invention can be widely applied to a plurality of fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.

It is apparent that the above-described embodiments are merely preferred examples of the present invention, and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The binary function similarity detection method based on graph convectors is characterized by comprising the following steps of:

2. The binary function similarity detection method based on graph transformations according to claim 1, wherein the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:

replacing all memory addresses with "mem";

replacing all numerical constants with "num";

replacing all character strings with str;

3. The binary function similarity detection method based on graph transformations of claim 1, wherein step S3 encodes each node in the control flow graph, extracts semantic information of an instruction, and generates feature vectors of the nodes, specifically as follows:

s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,/>In representation +.>The number of assembler instruction segments, m, represents the number of assembler instruction segments;

s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discard masks and attention probabilities on the full connection layer, so as to obtain two pieces of information with different discard masks> and />Is->Is embedded in the mold; namely:

，/> ；

wherein ,、/>is a random mask of two different discards, < ->、/>Drop mask +.>Is->Is embedded in (i)>Is a node encoder;

；

wherein ,indicate->Loss value of individual assembler instruction segment, +.>Representing the index in a small lot,/->Mask for discarding->A kind of electronic deviceIs embedded in (i)>Is temperature super parameter, < >>As a cosine similarity function.

4. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.

5. The method for detecting similarity of binary functions based on graph transformations of claim 4, wherein,

the method for adding the coding information in the step S4 comprises the following steps: adding new position codes in the input of the Transformers model to help encode distance awareness information and/or adding new spatial codes as bias terms in the self-attention module of the Transformers model to accurately capture spatial dependencies in the graph.

6. The method for detecting similarity of binary functions based on graph transformations of claim 5, wherein,

the method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplace feature vector, and specifically comprises the following steps:

；

wherein ,is a unitary matrix->For the adjacency matrix of the graph, ">For the degree matrix->、/>Eigenvalues and eigenvectors, respectively, corresponding to the laplace matrix>Representing a Laplace matrix of the graph;

(2) Adding position coding to input data, usingRepresenting node->Laplace position coding of (i.e. drawing Laplace matrix +)>The smallest non-trivial feature vector; the specific method comprises the following steps:

first, for each node in the control flow graphNode feature vector +.>These are embedded into +.>Dimension hiding feature->In (a):

；

wherein , and />Is embedded into->Maintaining linear projection layer parameters of hidden features;

pre-computed node position codes are then embedded into the linear projectionDimension feature vector +.>On, and add to node feature->On, generate new node feature ++>；

；

wherein ,representing the position-coded feature vector after linear projection,/-> and />For embedding->Linear projection parameters of the dimensional feature vector.

7. The graph-Transformers-based binary function similarity detection method according to claim 5, wherein the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model comprises assigning a learnable embedding to each node pair according to their spatial relationship and defining a function according to connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:

for any figureSetting a function +.>For measuring diagrams->Middle node-> and />A spatial relationship between; function->Through the figure->Connectivity between the intermediate nodes;

if two nodes are connected, thenFor node-> and />A shortest path therebetween;

if two nodes are not connected, thenThe output value of (2) is set to a special value +.>；

Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key products +.>The elements are as follows:

；

wherein , and />Is a weight matrix>Is->Matrix transpose of>For normalization term-> and />Representing node-> and />Feature vector of>Is made of->Learnable scalar of index and at all layersShared between them.

8. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S5 embeds feature vectors of two binary functions, i.e. graphs of a control flow graph, and performs connection input full connection layer to obtain probability of similarity of the two binary functions, specifically as follows:

；

wherein ,representing a full connection layer, "> and />Is a feature vector of a function->Indicating connection(s)>Probability that the two functions are similar.

9. A graph-transgenes-based binary function similarity detection system comprising a microprocessor and a memory, which are interconnected, characterized in that the microprocessor is programmed or configured to perform the steps of the graph-transgenes-based binary function similarity detection method according to any one of claims 1-7.

10. The graph-Transformers-based binary function similarity detection system of claim 9, comprising in particular:

the semantic perception module is used for encoding each node in the control flow graph through an unsupervised contrast learning method and generating a feature vector of each node;