CN116663004A - Binary function similarity detection method and system based on graph transformations - Google Patents
Binary function similarity detection method and system based on graph transformations Download PDFInfo
- Publication number
- CN116663004A CN116663004A CN202310931335.4A CN202310931335A CN116663004A CN 116663004 A CN116663004 A CN 116663004A CN 202310931335 A CN202310931335 A CN 202310931335A CN 116663004 A CN116663004 A CN 116663004A
- Authority
- CN
- China
- Prior art keywords
- graph
- node
- binary
- control flow
- functions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 33
- 230000009466 transformation Effects 0.000 title claims abstract description 20
- 238000000844 transformation Methods 0.000 title claims abstract description 20
- 230000006870 function Effects 0.000 claims abstract description 118
- 239000013598 vector Substances 0.000 claims abstract description 61
- 238000000034 method Methods 0.000 claims abstract description 44
- 238000004458 analytical method Methods 0.000 claims abstract description 18
- 238000010586 diagram Methods 0.000 claims abstract description 16
- 230000003068 static effect Effects 0.000 claims abstract description 13
- 239000011159 matrix material Substances 0.000 claims description 32
- 230000004931 aggregating effect Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000008447 perception Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 description 5
- 239000000758 substrate Substances 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000012633 leachable Substances 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Abstract
The invention provides a binary function similarity detection method and system based on graph transformations. The method comprises the following steps: carrying out static analysis on the binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node; normalizing the operation code of the assembly instruction; encoding each node in the CFG through an unsupervised contrast learning method to generate a feature vector of each node; extracting structural information of the CFG by using a graph Transformers model to generate graph embedding; and embedding the feature vectors of the two binary functions, namely the graphs, into a connection input full connection layer to obtain the probability that the two binary functions are similar. The method and the device for detecting the similarity of the binary functions by using the graph convectors can process complex function structures, can adapt to a large-scale binary code library, and have higher accuracy, efficiency and expandability.
Description
Technical Field
The invention relates to the technical field of information security, in particular to a binary function similarity detection method and system based on graph transformations.
Background
Binary analysis is an important component of many security applications, such as malware detection, binary vulnerability discovery, and software copyright protection. To extract meaningful information from binary code, researchers typically translate functions of binary code into graph structures, such as Control Flow Graphs (CFGs). They then use a graph matching or graph isomorphism algorithm to compare if the two functions are similar.
Such as: chinese patent document CN202011367342.9 discloses a binary similarity detection method based on a graph automatic encoder, comprising preprocessing two binary functions to be compared; obtaining control dependency relationship between basic blocks in two binary functions by using a disassembly tool and an interface provided by the disassembly tool; recording control flow diagrams in two binary functions, extracting the digital characteristics of nodes of the control flow diagrams, focusing on the calling relation between functions, and calling the characteristics of other functions of the functions in a function entry basic block; encoding the obtained control flow graph by using a graph automatic encoder GAE to obtain an embedded vector describing the graph structure; after fusing the embedded vectors of the characterization graph structures of the two binary functions and the digital characteristic information of the nodes, calculating a training model through a loss function; the trained model can determine whether the given two binary functions are similar. According to the scheme, the semantic structure features and the node semantic features in the functions are reserved to the greatest extent, so that the accuracy of similarity comparison is improved.
Chinese patent document CN202110690580.1 discloses a method for similarity analysis of intelligent contract binary functions, comprising decompiling byte codes to generate EVM instructions and corresponding parameters; reconstructing a control flow graph CFG according to the decompiled EVM instruction; dividing a CFG of a contract into a plurality of binary functions and determining timing relationships for edges in the CFG; extracting characteristic values and a graph structure; a model based on a time sequence aggregation diagram structure is designed, and the similarity of two binary functions can be obtained by comparing the aggregated diagram structure. The scheme is mainly aimed at similarity analysis of intelligent contract binary functions, and can process most contracts lacking source codes by directly researching byte codes of the contracts.
Another example is: chinese patent document CN202110607066.7 discloses a binary function similarity detection method of fusion influencing factors, which comprises the steps of firstly preprocessing two binary functions to obtain control flow charts (CFG 1, CFG 2) of the two binary functions; then, extracting the characteristics of each basic block in the CFG, representing the basic blocks into characteristic vectors, and generating corresponding attribute control flow graphs (ACFG 1 and ACFG 2); the attribute control flow diagrams ACFG1, ACFG2 of the two functions are then input into two identical graph embedded networks, which are converted into corresponding high-dimensional vectors. And (3) calculating cosine distances of the two high-dimensional vectors by minimizing parameters of the target function training diagram embedded in the network, and outputting the similarity of the two binary functions. The scheme is mainly used for solving the problem of information loss caused by neglecting different influences of the successor node and the neighbor node on the vertex in the binary function similarity detection method based on graph embedding.
However, none of the above existing binary function similarity detection methods can effectively deal with the complexity of functions and large-scale binary code libraries and over-rely on datasets, lacking scalability.
Disclosure of Invention
The invention aims to solve the technical problems that: aiming at the defects in the prior art, a binary function similarity detection method and system based on graph transformations are provided. The structure information of the CFG is extracted through the graph converters model, graph embedding is generated, and similarity detection of the binary function is realized, so that a complex function structure is handled correspondingly, a large-scale binary code library can be adapted, and the method has high detection accuracy, efficiency and expandability.
In order to solve the technical problems, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a binary function similarity detection method based on graph transformations, including the following steps:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node;
s2, normalizing the operation code of the assembly instruction in the assembly instruction section;
s3, coding each node in the control flow graph through an unsupervised contrast learning method to generate a feature vector of each node;
s4, extracting structural information of the control flow graph by using a graph Transformers model, and aggregating characteristic information of nodes and the structural information of the control flow graph to generate graph embedding;
and S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar.
Further, the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
Further, step S3 encodes each node in the control flow graph, extracts semantic information of the instruction, and generates a feature vector of the node, which specifically includes:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,in the representation of the firstThe number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discarding masks and attention probabilities on the full connection layer, so as to obtain two encoders with different discarding masksAnda kind of electronic deviceIs embedded in the mold; namely:
, ;
wherein ,、is two different discarded random masks,、respectively discarding masksA kind of electronic deviceIs provided with an embedding in the substrate,is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
;
wherein ,represent the firstThe penalty value of the individual assembler instruction segment,representing the index in a small batch of material,to discard masksA kind of electronic deviceIs provided with an embedding in the substrate,is used for the temperature super-parameter,as a cosine similarity function.
Further, step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
Further, the method for adding the encoded information in step S4 includes:
one approach adds a new position code in the input of the Transformers model to help encode distance awareness information;
alternatively, a scheme adds a new spatial code as a bias term in the self-attention module of the Transformers model to accurately capture the spatial dependencies in the graph;
alternatively, the coding information is added by adopting the two methods of adding a new position code into the input of the Transformers and adding a new space code into the self-attention module of the Transformers as a deviation term.
Furthermore, a method for adding a new position code to the input of the Transformers model is to encode node position information for any graph by using the laplace feature vector, which is specifically as follows:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
;
wherein ,is a matrix of units which is a matrix of units,as an adjacency matrix of the graph,in the form of a degree matrix,、respectively corresponding to the eigenvalues and eigenvectors of the laplace matrix,representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. atlas)Matrix of the LasThe smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector of (a)Embedding these into a linear projectionDimension hiding featuresIn (a):
wherein ,andis embedded intoMaintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vectorOn, and add to node featuresGenerating new node characteristics;
;
wherein ,representing the position-coded feature vector after linear projection,andfor being embedded intoLinear projection parameters of the dimensional feature vector.
Still further, the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model includes assigning a learnable embedment to each node pair based on their spatial relationship and defining a function based on connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:
for any figureSetting a functionFor measuring diagramsMiddle nodeAnda spatial relationship between; function ofThrough the graphConnectivity between the intermediate nodes;
if two nodes are connected, thenIs a nodeAnda shortest path therebetween;
if two nodes are not connected, thenThe output value of (2) is set to a particular value;
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key productsThe elements are as follows:
;
wherein ,andas a matrix of weights, the weight matrix,is thatIs used for the matrix transpose of (a),for the normalization of the terms,andrepresenting nodesAndis used for the feature vector of (a),is composed ofThe learnable scalar of the index is shared among all layers.
Further, step S5 embeds feature vectors of two binary functions, i.e. graphs of the control flow graph, and performs connection input to the full connection layer to obtain probabilities that the two binary functions are similar, specifically as follows:
;
wherein ,indicating that the full-link layer is to be formed,andas a feature vector of the function,the connection is represented by a representation of the connection,probability that the two functions are similar.
In a second aspect, the present invention also provides a binary function similarity detection system based on graph transformations, comprising a microprocessor and a memory, which are interconnected, the microprocessor being programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations described above.
Further, the binary function similarity detection system based on graph convectors specifically comprises:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised comparison learning method, generating a feature vector of each node and extracting semantic information of an assembly instruction;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
In the invention, the following components are added:
transgramers: the neural network structure is mainly used for processing sequence data in the field of deep learning. The core idea is to use a Self-attention mechanism (Self-Attention Mechanism) to capture dependencies in the sequence, without relying on traditional recursive or convolutional structures. The characteristics are as follows: 1) Self-attention mechanism: the kernel component of the Transformers allows the model to assign a different attention weight to each element in the sequence. 2) Position coding: the order information of the elements in the model sequence is given by position coding. 3) Multi-head attention: each transducer layer contains multiple attention "heads" and the model can focus on multiple positions in the sequence simultaneously. 4) Feedforward neural network: after each transducer layer, there is a feed-forward neural network for further processing the self-attention output.
Graph transformations: graph Transformers is a deep learning model architecture based on convectors, and aims to process graph structure data. The method combines the self-attention mechanism of the traditional Transformers architecture and special processing skills for graph data, thereby being capable of capturing complex structures and relations in the captured graph. The characteristics are as follows: 1) Self-attention and graph structure: unlike conventional transformations, the self-attention mechanism of graph transformations is to consider the relationships of nodes in the graph, enabling the model to focus on other nodes that are related to or have an impact on the current node. 2) Position coding: the structural context of the node in the graph or its specific topological properties are provided.
The invention has the beneficial effects that:
compared with the prior art, the binary function similarity detection method has the advantages that the binary function similarity detection is performed by using the graph converters, so that complex function structures can be processed, the method can be suitable for a large-scale binary code library, and the detection accuracy and efficiency are high.
The method can be widely applied to multiple fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a basic flow diagram of a binary function similarity detection method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to examples and figures, which are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, the present embodiment provides a binary function similarity detection method based on graph transformations, which includes the following steps:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node.
S2, normalizing the operation code of the assembly instruction in the assembly instruction section; the method comprises the following steps:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
S3, coding each node in the control flow graph through an unsupervised contrast learning method, extracting semantic information of an instruction, and generating a feature vector of each node; the method comprises the following steps:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,in the representation of the firstThe number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discarding masks and attention probabilities on the full connection layer, so as to obtain two encoders with different discarding masksAnda kind of electronic deviceIs embedded in the mold; namely:
, ;
wherein ,、is two different discarded random masks,、respectively discarding masksA kind of electronic deviceIs provided with an embedding in the substrate,is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
;
wherein ,represent the firstThe penalty value of the individual assembler instruction segment,representing the index in a small batch of material,to discard masksA kind of electronic deviceIs provided with an embedding in the substrate,is used for the temperature super-parameter,as a cosine similarity function.
And S4, extracting structural information of the control flow graph by using a graph converters model, aggregating characteristic information of nodes and structural information of the control flow graph, and generating graph embedding. The method specifically comprises the following steps: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
The method for adding the coding information comprises the following steps:
adding a new position code in the input of the Transformers model to help encode the distance sensing information;
adding a new spatial code as a deviation term in a self-attention module of a Transformers model so as to accurately capture the spatial dependence in the graph;
alternatively, the coding information is added by adopting the two methods of adding a new position code into the input of the Transformers and adding a new space code into the self-attention module of the Transformers as a deviation term.
The method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplacian eigenvector, and specifically comprises the following steps:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
;
wherein ,is a matrix of units which is a matrix of units,as an adjacency matrix of the graph,in the form of a degree matrix,、respectively corresponding to the eigenvalues and eigenvectors of the laplace matrix,representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. a graph Laplace matrix)The smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector of (a)Embedding these into a linear projectionDimension hiding featuresIn (a):
;
wherein ,andis embedded intoMaintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vectorOn, and add to node featuresGenerating new node characteristics;
;
wherein ,representing the position-coded feature vector after linear projection,andfor being embedded intoLinear projection parameters of the dimensional feature vector.
The method for adding new space codes in the self-attention module of the Transformers model as deviation items comprises the steps of distributing a leachable embedding for each node pair according to the space relation of the node pairs, and defining a function according to the connectivity between the nodes in the graph to measure the space relation between the two nodes in the graph; the method comprises the following steps:
for any figureSetting a functionFor measuring diagramsMiddle nodeAnda spatial relationship between; function ofThrough the graphConnectivity between the intermediate nodes;
if two nodes are connected, thenIs a nodeAndshortest Path (SPD) between;
if two nodes are not connected, thenThe output value of (2) is set to a particular value;
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key productsThe elements are as follows:
;
wherein ,andas a matrix of weights, the weight matrix,is thatIs used for the matrix transpose of (a),for the normalization of the terms,andrepresenting nodesAndis used for the feature vector of (a),is composed ofThe learnable scalar of the index is shared among all layers.
And S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar. The method comprises the following steps:
;
wherein ,indicating that the full-link layer is to be formed,andas a feature vector of the function,the connection is represented by a representation of the connection,probability that the two functions are similar.
Example 2
The present embodiment provides a binary function similarity detection system based on graph transformations, which includes a microprocessor and a memory connected to each other, where the microprocessor is programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations. The method specifically comprises the following steps:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised comparison learning method, generating a feature vector of each node and extracting semantic information of an assembly instruction;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
Based on the same inventive concept, this embodiment is a system embodiment corresponding to the above method embodiment, and may be implemented in cooperation with the above implementation manner. The related technical details mentioned in the above embodiments are still valid in this embodiment, and the repetition is not repeated.
The method provided by the embodiment of the invention not only can process complex function structures, but also can adapt to a large-scale binary code library, and has higher detection accuracy and efficiency. The embodiment of the invention can be widely applied to a plurality of fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.
It is apparent that the above-described embodiments are merely preferred examples of the present invention, and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.
Claims (10)
1. The binary function similarity detection method based on graph convectors is characterized by comprising the following steps of:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node;
s2, normalizing the operation code of the assembly instruction in the assembly instruction section;
s3, coding each node in the control flow graph through an unsupervised contrast learning method to generate a feature vector of each node;
s4, extracting structural information of the control flow graph by using a graph Transformers model, and aggregating characteristic information of nodes and the structural information of the control flow graph to generate graph embedding;
and S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar.
2. The binary function similarity detection method based on graph transformations according to claim 1, wherein the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
3. The binary function similarity detection method based on graph transformations of claim 1, wherein step S3 encodes each node in the control flow graph, extracts semantic information of an instruction, and generates feature vectors of the nodes, specifically as follows:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,/>In representation +.>The number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discard masks and attention probabilities on the full connection layer, so as to obtain two pieces of information with different discard masks> and />Is->Is embedded in the mold; namely:
,/> ;
wherein ,、/>is a random mask of two different discards, < ->、/>Drop mask +.>Is->Is embedded in (i)>Is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
;
wherein ,indicate->Loss value of individual assembler instruction segment, +.>Representing the index in a small lot,/->Mask for discarding->A kind of electronic deviceIs embedded in (i)>Is temperature super parameter, < >>As a cosine similarity function.
4. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
5. The method for detecting similarity of binary functions based on graph transformations of claim 4, wherein,
the method for adding the coding information in the step S4 comprises the following steps: adding new position codes in the input of the Transformers model to help encode distance awareness information and/or adding new spatial codes as bias terms in the self-attention module of the Transformers model to accurately capture spatial dependencies in the graph.
6. The method for detecting similarity of binary functions based on graph transformations of claim 5, wherein,
the method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplace feature vector, and specifically comprises the following steps:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
;
wherein ,is a unitary matrix->For the adjacency matrix of the graph, ">For the degree matrix->、/>Eigenvalues and eigenvectors, respectively, corresponding to the laplace matrix>Representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting node->Laplace position coding of (i.e. drawing Laplace matrix +)>The smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector +.>These are embedded into +.>Dimension hiding feature->In (a):
;
wherein , and />Is embedded into->Maintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vector +.>On, and add to node feature->On, generate new node feature ++>;
;
wherein ,representing the position-coded feature vector after linear projection,/-> and />For embedding->Linear projection parameters of the dimensional feature vector.
7. The graph-Transformers-based binary function similarity detection method according to claim 5, wherein the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model comprises assigning a learnable embedding to each node pair according to their spatial relationship and defining a function according to connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:
for any figureSetting a function +.>For measuring diagrams->Middle node-> and />A spatial relationship between; function->Through the figure->Connectivity between the intermediate nodes;
if two nodes are connected, thenFor node-> and />A shortest path therebetween;
if two nodes are not connected, thenThe output value of (2) is set to a special value +.>;
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key products +.>The elements are as follows:
;
wherein , and />Is a weight matrix>Is->Matrix transpose of>For normalization term-> and />Representing node-> and />Feature vector of>Is made of->Learnable scalar of index and at all layersShared between them.
8. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S5 embeds feature vectors of two binary functions, i.e. graphs of a control flow graph, and performs connection input full connection layer to obtain probability of similarity of the two binary functions, specifically as follows:
;
wherein ,representing a full connection layer, "> and />Is a feature vector of a function->Indicating connection(s)>Probability that the two functions are similar.
9. A graph-transgenes-based binary function similarity detection system comprising a microprocessor and a memory, which are interconnected, characterized in that the microprocessor is programmed or configured to perform the steps of the graph-transgenes-based binary function similarity detection method according to any one of claims 1-7.
10. The graph-Transformers-based binary function similarity detection system of claim 9, comprising in particular:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised contrast learning method and generating a feature vector of each node;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310931335.4A CN116663004B (en) | 2023-07-27 | 2023-07-27 | Binary function similarity detection method and system based on graph transformations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310931335.4A CN116663004B (en) | 2023-07-27 | 2023-07-27 | Binary function similarity detection method and system based on graph transformations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116663004A true CN116663004A (en) | 2023-08-29 |
CN116663004B CN116663004B (en) | 2023-09-29 |
Family
ID=87720918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310931335.4A Active CN116663004B (en) | 2023-07-27 | 2023-07-27 | Binary function similarity detection method and system based on graph transformations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116663004B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688150A (en) * | 2019-09-03 | 2020-01-14 | 华中科技大学 | Binary file code search detection method and system based on tensor operation |
CN112163226A (en) * | 2020-11-30 | 2021-01-01 | 中国人民解放军国防科技大学 | Binary similarity detection method based on graph automatic encoder |
CN112733137A (en) * | 2020-12-24 | 2021-04-30 | 哈尔滨工业大学 | Binary code similarity analysis method for vulnerability detection |
US20210216577A1 (en) * | 2020-01-13 | 2021-07-15 | Adobe Inc. | Reader-retriever approach for question answering |
CN115113877A (en) * | 2022-07-06 | 2022-09-27 | 上海交通大学 | Cross-architecture binary code similarity detection method and system |
CN115858002A (en) * | 2023-02-06 | 2023-03-28 | 湖南大学 | Binary code similarity detection method and system based on graph comparison learning and storage medium |
CN116032654A (en) * | 2023-02-13 | 2023-04-28 | 山东省计算中心(国家超级计算济南中心) | Firmware vulnerability detection and data security management method and system |
WO2023072421A1 (en) * | 2021-10-29 | 2023-05-04 | NEC Laboratories Europe GmbH | System and method for inductive learning on graphs with knowledge from language models |
-
2023
- 2023-07-27 CN CN202310931335.4A patent/CN116663004B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688150A (en) * | 2019-09-03 | 2020-01-14 | 华中科技大学 | Binary file code search detection method and system based on tensor operation |
US20210216577A1 (en) * | 2020-01-13 | 2021-07-15 | Adobe Inc. | Reader-retriever approach for question answering |
CN112163226A (en) * | 2020-11-30 | 2021-01-01 | 中国人民解放军国防科技大学 | Binary similarity detection method based on graph automatic encoder |
CN112733137A (en) * | 2020-12-24 | 2021-04-30 | 哈尔滨工业大学 | Binary code similarity analysis method for vulnerability detection |
WO2023072421A1 (en) * | 2021-10-29 | 2023-05-04 | NEC Laboratories Europe GmbH | System and method for inductive learning on graphs with knowledge from language models |
CN115113877A (en) * | 2022-07-06 | 2022-09-27 | 上海交通大学 | Cross-architecture binary code similarity detection method and system |
CN115858002A (en) * | 2023-02-06 | 2023-03-28 | 湖南大学 | Binary code similarity detection method and system based on graph comparison learning and storage medium |
CN116032654A (en) * | 2023-02-13 | 2023-04-28 | 山东省计算中心(国家超级计算济南中心) | Firmware vulnerability detection and data security management method and system |
Non-Patent Citations (3)
Title |
---|
崔宝江;马丁;郝永乐;王建新;: "基于基本块签名和跳转关系的二进制文件比对技术", 清华大学学报(自然科学版), no. 10 * |
王工博;蒋烈辉;司彬彬;董卫宇;: "基于栈结构恢复的固件栈溢出漏洞相似性检测", 信息工程大学学报, no. 02, pages 1 - 10 * |
陈昱;刘中金;赵威威;马原;石志强;孙利民;: "一种大规模的跨平台同源二进制文件检索方法", 计算机研究与发展, no. 07, pages 1 - 10 * |
Also Published As
Publication number | Publication date |
---|---|
CN116663004B (en) | 2023-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111459491B (en) | Code recommendation method based on tree neural network | |
CN113010209A (en) | Binary code similarity comparison technology for resisting compiling difference | |
Gao et al. | Semantic learning based cross-platform binary vulnerability search for IoT devices | |
Lee et al. | Visual question answering over scene graph | |
CN115951883B (en) | Service component management system of distributed micro-service architecture and method thereof | |
CN116336400B (en) | Baseline detection method for oil and gas gathering and transportation pipeline | |
CN115344863A (en) | Malicious software rapid detection method based on graph neural network | |
CN115033890A (en) | Comparison learning-based source code vulnerability detection method and system | |
CN116405326A (en) | Information security management method and system based on block chain | |
CN113904844B (en) | Intelligent contract vulnerability detection method based on cross-mode teacher-student network | |
Kim et al. | A comparative study on performance of deep learning models for vision-based concrete crack detection according to model types | |
CN114723003A (en) | Event sequence prediction method based on time sequence convolution and relational modeling | |
CN116861431B (en) | Malicious software classification method and system based on multichannel image and neural network | |
CN116663004B (en) | Binary function similarity detection method and system based on graph transformations | |
CN115240120B (en) | Behavior identification method based on countermeasure network and electronic equipment | |
CN117009968A (en) | Homology analysis method and device for malicious codes, terminal equipment and storage medium | |
CN114997360B (en) | Evolution parameter optimization method, system and storage medium of neural architecture search algorithm | |
CN116958809A (en) | Remote sensing small sample target detection method for feature library migration | |
CN111562943B (en) | Code clone detection method and device based on event embedded tree and GAT network | |
CN115758362A (en) | Multi-feature-based automatic malicious software detection method | |
CN116226864A (en) | Network security-oriented code vulnerability detection method and system | |
CN116628695A (en) | Vulnerability discovery method and device based on multitask learning | |
Li et al. | STADE-CDNet: Spatial–Temporal Attention With Difference Enhancement-Based Network for Remote Sensing Image Change Detection | |
CN113986251A (en) | GUI prototype graph code conversion method based on convolution and cyclic neural network | |
CN114091021A (en) | Malicious code detection method for electric power enterprise safety protection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |