CN116663004A - Binary function similarity detection method and system based on graph transformations - Google Patents

Binary function similarity detection method and system based on graph transformations Download PDF

Info

Publication number
CN116663004A
CN116663004A CN202310931335.4A CN202310931335A CN116663004A CN 116663004 A CN116663004 A CN 116663004A CN 202310931335 A CN202310931335 A CN 202310931335A CN 116663004 A CN116663004 A CN 116663004A
Authority
CN
China
Prior art keywords
graph
node
binary
control flow
functions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310931335.4A
Other languages
Chinese (zh)
Other versions
CN116663004B (en
Inventor
张云
刘玉玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202310931335.4A priority Critical patent/CN116663004B/en
Publication of CN116663004A publication Critical patent/CN116663004A/en
Application granted granted Critical
Publication of CN116663004B publication Critical patent/CN116663004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Abstract

The invention provides a binary function similarity detection method and system based on graph transformations. The method comprises the following steps: carrying out static analysis on the binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node; normalizing the operation code of the assembly instruction; encoding each node in the CFG through an unsupervised contrast learning method to generate a feature vector of each node; extracting structural information of the CFG by using a graph Transformers model to generate graph embedding; and embedding the feature vectors of the two binary functions, namely the graphs, into a connection input full connection layer to obtain the probability that the two binary functions are similar. The method and the device for detecting the similarity of the binary functions by using the graph convectors can process complex function structures, can adapt to a large-scale binary code library, and have higher accuracy, efficiency and expandability.

Description

Binary function similarity detection method and system based on graph transformations
Technical Field
The invention relates to the technical field of information security, in particular to a binary function similarity detection method and system based on graph transformations.
Background
Binary analysis is an important component of many security applications, such as malware detection, binary vulnerability discovery, and software copyright protection. To extract meaningful information from binary code, researchers typically translate functions of binary code into graph structures, such as Control Flow Graphs (CFGs). They then use a graph matching or graph isomorphism algorithm to compare if the two functions are similar.
Such as: chinese patent document CN202011367342.9 discloses a binary similarity detection method based on a graph automatic encoder, comprising preprocessing two binary functions to be compared; obtaining control dependency relationship between basic blocks in two binary functions by using a disassembly tool and an interface provided by the disassembly tool; recording control flow diagrams in two binary functions, extracting the digital characteristics of nodes of the control flow diagrams, focusing on the calling relation between functions, and calling the characteristics of other functions of the functions in a function entry basic block; encoding the obtained control flow graph by using a graph automatic encoder GAE to obtain an embedded vector describing the graph structure; after fusing the embedded vectors of the characterization graph structures of the two binary functions and the digital characteristic information of the nodes, calculating a training model through a loss function; the trained model can determine whether the given two binary functions are similar. According to the scheme, the semantic structure features and the node semantic features in the functions are reserved to the greatest extent, so that the accuracy of similarity comparison is improved.
Chinese patent document CN202110690580.1 discloses a method for similarity analysis of intelligent contract binary functions, comprising decompiling byte codes to generate EVM instructions and corresponding parameters; reconstructing a control flow graph CFG according to the decompiled EVM instruction; dividing a CFG of a contract into a plurality of binary functions and determining timing relationships for edges in the CFG; extracting characteristic values and a graph structure; a model based on a time sequence aggregation diagram structure is designed, and the similarity of two binary functions can be obtained by comparing the aggregated diagram structure. The scheme is mainly aimed at similarity analysis of intelligent contract binary functions, and can process most contracts lacking source codes by directly researching byte codes of the contracts.
Another example is: chinese patent document CN202110607066.7 discloses a binary function similarity detection method of fusion influencing factors, which comprises the steps of firstly preprocessing two binary functions to obtain control flow charts (CFG 1, CFG 2) of the two binary functions; then, extracting the characteristics of each basic block in the CFG, representing the basic blocks into characteristic vectors, and generating corresponding attribute control flow graphs (ACFG 1 and ACFG 2); the attribute control flow diagrams ACFG1, ACFG2 of the two functions are then input into two identical graph embedded networks, which are converted into corresponding high-dimensional vectors. And (3) calculating cosine distances of the two high-dimensional vectors by minimizing parameters of the target function training diagram embedded in the network, and outputting the similarity of the two binary functions. The scheme is mainly used for solving the problem of information loss caused by neglecting different influences of the successor node and the neighbor node on the vertex in the binary function similarity detection method based on graph embedding.
However, none of the above existing binary function similarity detection methods can effectively deal with the complexity of functions and large-scale binary code libraries and over-rely on datasets, lacking scalability.
Disclosure of Invention
The invention aims to solve the technical problems that: aiming at the defects in the prior art, a binary function similarity detection method and system based on graph transformations are provided. The structure information of the CFG is extracted through the graph converters model, graph embedding is generated, and similarity detection of the binary function is realized, so that a complex function structure is handled correspondingly, a large-scale binary code library can be adapted, and the method has high detection accuracy, efficiency and expandability.
In order to solve the technical problems, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a binary function similarity detection method based on graph transformations, including the following steps:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node;
s2, normalizing the operation code of the assembly instruction in the assembly instruction section;
s3, coding each node in the control flow graph through an unsupervised contrast learning method to generate a feature vector of each node;
s4, extracting structural information of the control flow graph by using a graph Transformers model, and aggregating characteristic information of nodes and the structural information of the control flow graph to generate graph embedding;
and S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar.
Further, the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
Further, step S3 encodes each node in the control flow graph, extracts semantic information of the instruction, and generates a feature vector of the node, which specifically includes:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,in the representation of the firstThe number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discarding masks and attention probabilities on the full connection layer, so as to obtain two encoders with different discarding masksAnda kind of electronic deviceIs embedded in the mold; namely:
wherein ,is two different discarded random masks,respectively discarding masksA kind of electronic deviceIs provided with an embedding in the substrate,is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
wherein ,represent the firstThe penalty value of the individual assembler instruction segment,representing the index in a small batch of material,to discard masksA kind of electronic deviceIs provided with an embedding in the substrate,is used for the temperature super-parameter,as a cosine similarity function.
Further, step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
Further, the method for adding the encoded information in step S4 includes:
one approach adds a new position code in the input of the Transformers model to help encode distance awareness information;
alternatively, a scheme adds a new spatial code as a bias term in the self-attention module of the Transformers model to accurately capture the spatial dependencies in the graph;
alternatively, the coding information is added by adopting the two methods of adding a new position code into the input of the Transformers and adding a new space code into the self-attention module of the Transformers as a deviation term.
Furthermore, a method for adding a new position code to the input of the Transformers model is to encode node position information for any graph by using the laplace feature vector, which is specifically as follows:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
wherein ,is a matrix of units which is a matrix of units,as an adjacency matrix of the graph,in the form of a degree matrix,respectively corresponding to the eigenvalues and eigenvectors of the laplace matrix,representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. atlas)Matrix of the LasThe smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector of (a)Embedding these into a linear projectionDimension hiding featuresIn (a):
wherein ,andis embedded intoMaintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vectorOn, and add to node featuresGenerating new node characteristics
wherein ,representing the position-coded feature vector after linear projection,andfor being embedded intoLinear projection parameters of the dimensional feature vector.
Still further, the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model includes assigning a learnable embedment to each node pair based on their spatial relationship and defining a function based on connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:
for any figureSetting a functionFor measuring diagramsMiddle nodeAnda spatial relationship between; function ofThrough the graphConnectivity between the intermediate nodes;
if two nodes are connected, thenIs a nodeAnda shortest path therebetween;
if two nodes are not connected, thenThe output value of (2) is set to a particular value
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key productsThe elements are as follows:
wherein ,andas a matrix of weights, the weight matrix,is thatIs used for the matrix transpose of (a),for the normalization of the terms,andrepresenting nodesAndis used for the feature vector of (a),is composed ofThe learnable scalar of the index is shared among all layers.
Further, step S5 embeds feature vectors of two binary functions, i.e. graphs of the control flow graph, and performs connection input to the full connection layer to obtain probabilities that the two binary functions are similar, specifically as follows:
wherein ,indicating that the full-link layer is to be formed,andas a feature vector of the function,the connection is represented by a representation of the connection,probability that the two functions are similar.
In a second aspect, the present invention also provides a binary function similarity detection system based on graph transformations, comprising a microprocessor and a memory, which are interconnected, the microprocessor being programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations described above.
Further, the binary function similarity detection system based on graph convectors specifically comprises:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised comparison learning method, generating a feature vector of each node and extracting semantic information of an assembly instruction;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
In the invention, the following components are added:
transgramers: the neural network structure is mainly used for processing sequence data in the field of deep learning. The core idea is to use a Self-attention mechanism (Self-Attention Mechanism) to capture dependencies in the sequence, without relying on traditional recursive or convolutional structures. The characteristics are as follows: 1) Self-attention mechanism: the kernel component of the Transformers allows the model to assign a different attention weight to each element in the sequence. 2) Position coding: the order information of the elements in the model sequence is given by position coding. 3) Multi-head attention: each transducer layer contains multiple attention "heads" and the model can focus on multiple positions in the sequence simultaneously. 4) Feedforward neural network: after each transducer layer, there is a feed-forward neural network for further processing the self-attention output.
Graph transformations: graph Transformers is a deep learning model architecture based on convectors, and aims to process graph structure data. The method combines the self-attention mechanism of the traditional Transformers architecture and special processing skills for graph data, thereby being capable of capturing complex structures and relations in the captured graph. The characteristics are as follows: 1) Self-attention and graph structure: unlike conventional transformations, the self-attention mechanism of graph transformations is to consider the relationships of nodes in the graph, enabling the model to focus on other nodes that are related to or have an impact on the current node. 2) Position coding: the structural context of the node in the graph or its specific topological properties are provided.
The invention has the beneficial effects that:
compared with the prior art, the binary function similarity detection method has the advantages that the binary function similarity detection is performed by using the graph converters, so that complex function structures can be processed, the method can be suitable for a large-scale binary code library, and the detection accuracy and efficiency are high.
The method can be widely applied to multiple fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a basic flow diagram of a binary function similarity detection method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to examples and figures, which are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, the present embodiment provides a binary function similarity detection method based on graph transformations, which includes the following steps:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node.
S2, normalizing the operation code of the assembly instruction in the assembly instruction section; the method comprises the following steps:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
S3, coding each node in the control flow graph through an unsupervised contrast learning method, extracting semantic information of an instruction, and generating a feature vector of each node; the method comprises the following steps:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,in the representation of the firstThe number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discarding masks and attention probabilities on the full connection layer, so as to obtain two encoders with different discarding masksAnda kind of electronic deviceIs embedded in the mold; namely:
wherein ,is two different discarded random masks,respectively discarding masksA kind of electronic deviceIs provided with an embedding in the substrate,is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
wherein ,represent the firstThe penalty value of the individual assembler instruction segment,representing the index in a small batch of material,to discard masksA kind of electronic deviceIs provided with an embedding in the substrate,is used for the temperature super-parameter,as a cosine similarity function.
And S4, extracting structural information of the control flow graph by using a graph converters model, aggregating characteristic information of nodes and structural information of the control flow graph, and generating graph embedding. The method specifically comprises the following steps: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
The method for adding the coding information comprises the following steps:
adding a new position code in the input of the Transformers model to help encode the distance sensing information;
adding a new spatial code as a deviation term in a self-attention module of a Transformers model so as to accurately capture the spatial dependence in the graph;
alternatively, the coding information is added by adopting the two methods of adding a new position code into the input of the Transformers and adding a new space code into the self-attention module of the Transformers as a deviation term.
The method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplacian eigenvector, and specifically comprises the following steps:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
wherein ,is a matrix of units which is a matrix of units,as an adjacency matrix of the graph,in the form of a degree matrix,respectively corresponding to the eigenvalues and eigenvectors of the laplace matrix,representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting nodesLaplace position coding of (i.e. a graph Laplace matrix)The smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector of (a)Embedding these into a linear projectionDimension hiding featuresIn (a):
wherein ,andis embedded intoMaintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vectorOn, and add to node featuresGenerating new node characteristics
wherein ,representing the position-coded feature vector after linear projection,andfor being embedded intoLinear projection parameters of the dimensional feature vector.
The method for adding new space codes in the self-attention module of the Transformers model as deviation items comprises the steps of distributing a leachable embedding for each node pair according to the space relation of the node pairs, and defining a function according to the connectivity between the nodes in the graph to measure the space relation between the two nodes in the graph; the method comprises the following steps:
for any figureSetting a functionFor measuring diagramsMiddle nodeAnda spatial relationship between; function ofThrough the graphConnectivity between the intermediate nodes;
if two nodes are connected, thenIs a nodeAndshortest Path (SPD) between;
if two nodes are not connected, thenThe output value of (2) is set to a particular value
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key productsThe elements are as follows:
wherein ,andas a matrix of weights, the weight matrix,is thatIs used for the matrix transpose of (a),for the normalization of the terms,andrepresenting nodesAndis used for the feature vector of (a),is composed ofThe learnable scalar of the index is shared among all layers.
And S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar. The method comprises the following steps:
wherein ,indicating that the full-link layer is to be formed,andas a feature vector of the function,the connection is represented by a representation of the connection,probability that the two functions are similar.
Example 2
The present embodiment provides a binary function similarity detection system based on graph transformations, which includes a microprocessor and a memory connected to each other, where the microprocessor is programmed or configured to perform the steps of the binary function similarity detection method based on graph transformations. The method specifically comprises the following steps:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised comparison learning method, generating a feature vector of each node and extracting semantic information of an assembly instruction;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
Based on the same inventive concept, this embodiment is a system embodiment corresponding to the above method embodiment, and may be implemented in cooperation with the above implementation manner. The related technical details mentioned in the above embodiments are still valid in this embodiment, and the repetition is not repeated.
The method provided by the embodiment of the invention not only can process complex function structures, but also can adapt to a large-scale binary code library, and has higher detection accuracy and efficiency. The embodiment of the invention can be widely applied to a plurality of fields such as binary code analysis, malicious software detection, binary vulnerability discovery, software copyright protection and the like, and has wide application prospect.
It is apparent that the above-described embodiments are merely preferred examples of the present invention, and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (10)

1. The binary function similarity detection method based on graph convectors is characterized by comprising the following steps of:
s1, acquiring two binary functions to be detected, and respectively carrying out static analysis on the acquired binary functions to acquire a control flow graph of the binary functions and assembly instruction segments of each node;
s2, normalizing the operation code of the assembly instruction in the assembly instruction section;
s3, coding each node in the control flow graph through an unsupervised contrast learning method to generate a feature vector of each node;
s4, extracting structural information of the control flow graph by using a graph Transformers model, and aggregating characteristic information of nodes and the structural information of the control flow graph to generate graph embedding;
and S5, connecting and inputting the feature vectors of the two binary functions into the full-connection layer to obtain the probability that the two binary functions are similar.
2. The binary function similarity detection method based on graph transformations according to claim 1, wherein the normalization processing of the operation code of the assembly instruction in step S2 specifically includes:
replacing all memory addresses with "mem";
renaming all general registers to reg { size }, which is the number of bytes operated by the operation code;
replacing all numerical constants with "num";
replacing all character strings with str;
the function names and non-library functions of all non-system functions are replaced with "fun".
3. The binary function similarity detection method based on graph transformations of claim 1, wherein step S3 encodes each node in the control flow graph, extracts semantic information of an instruction, and generates feature vectors of the nodes, specifically as follows:
s301, acquiring assembly instruction segments of nodes in control flow diagrams, wherein ,/>In representation +.>The number of assembler instruction segments, m, represents the number of assembler instruction segments;
s302, then by pairingApplying different, independently sampled discard masks to create a positive example for it; the specific operation is that when the node characteristics are encoded by the encoder, the same input is input into the encoder twice by using different discard masks and attention probabilities on the full connection layer, so as to obtain two pieces of information with different discard masks> and />Is->Is embedded in the mold; namely:
,/>
wherein ,、/>is a random mask of two different discards, < ->、/>Drop mask +.>Is->Is embedded in (i)>Is a node encoder;
has the following characteristics ofThe training targets of the small batch of pairs are:
wherein ,indicate->Loss value of individual assembler instruction segment, +.>Representing the index in a small lot,/->Mask for discarding->A kind of electronic deviceIs embedded in (i)>Is temperature super parameter, < >>As a cosine similarity function.
4. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S4 specifically refers to: and inputting the graph data into a Transformers model, adding coding information, aggregating the feature vectors of the nodes and capturing the structural information of the graph, and finally outputting the embedded feature vectors of the graph.
5. The method for detecting similarity of binary functions based on graph transformations of claim 4, wherein,
the method for adding the coding information in the step S4 comprises the following steps: adding new position codes in the input of the Transformers model to help encode distance awareness information and/or adding new spatial codes as bias terms in the self-attention module of the Transformers model to accurately capture spatial dependencies in the graph.
6. The method for detecting similarity of binary functions based on graph transformations of claim 5, wherein,
the method for adding new position codes in the input of the Transformers model is to code node position information for any graph by using the Laplace feature vector, and specifically comprises the following steps:
(1) The Laplace feature vectors of all the graphs in the dataset are pre-computed, the Laplace feature vectors being defined by factorization of the Laplace matrix of the graph:
wherein ,is a unitary matrix->For the adjacency matrix of the graph, ">For the degree matrix->、/>Eigenvalues and eigenvectors, respectively, corresponding to the laplace matrix>Representing a Laplace matrix of the graph;
(2) Adding position coding to input data, usingRepresenting node->Laplace position coding of (i.e. drawing Laplace matrix +)>The smallest non-trivial feature vector; the specific method comprises the following steps:
first, for each node in the control flow graphNode feature vector +.>These are embedded into +.>Dimension hiding feature->In (a):
wherein , and />Is embedded into->Maintaining linear projection layer parameters of hidden features;
pre-computed node position codes are then embedded into the linear projectionDimension feature vector +.>On, and add to node feature->On, generate new node feature ++>
wherein ,representing the position-coded feature vector after linear projection,/-> and />For embedding->Linear projection parameters of the dimensional feature vector.
7. The graph-Transformers-based binary function similarity detection method according to claim 5, wherein the method of adding new spatial codes as bias terms in the self-attention module of the Transformers model comprises assigning a learnable embedding to each node pair according to their spatial relationship and defining a function according to connectivity between nodes in the graph to measure the spatial relationship between two nodes in the graph; the method comprises the following steps:
for any figureSetting a function +.>For measuring diagrams->Middle node-> and />A spatial relationship between; function->Through the figure->Connectivity between the intermediate nodes;
if two nodes are connected, thenFor node-> and />A shortest path therebetween;
if two nodes are not connected, thenThe output value of (2) is set to a special value +.>
Assigning a learnable scalar to each output value and using it as a bias term in the self-attention module; specifically, it willExpressed as a matrix of Query-Key products +.>The elements are as follows:
wherein , and />Is a weight matrix>Is->Matrix transpose of>For normalization term-> and />Representing node-> and />Feature vector of>Is made of->Learnable scalar of index and at all layersShared between them.
8. The binary function similarity detection method based on graph transformations according to claim 1, wherein step S5 embeds feature vectors of two binary functions, i.e. graphs of a control flow graph, and performs connection input full connection layer to obtain probability of similarity of the two binary functions, specifically as follows:
wherein ,representing a full connection layer, "> and />Is a feature vector of a function->Indicating connection(s)>Probability that the two functions are similar.
9. A graph-transgenes-based binary function similarity detection system comprising a microprocessor and a memory, which are interconnected, characterized in that the microprocessor is programmed or configured to perform the steps of the graph-transgenes-based binary function similarity detection method according to any one of claims 1-7.
10. The graph-Transformers-based binary function similarity detection system of claim 9, comprising in particular:
the static analysis module is used for acquiring two binary functions to be detected; respectively carrying out static analysis on the obtained binary function to obtain a control flow diagram of the binary function and an assembly instruction segment of each node;
the preprocessing module is used for carrying out normalization processing on the operation code of the assembly instruction in the assembly instruction section obtained after static analysis;
the semantic perception module is used for encoding each node in the control flow graph through an unsupervised contrast learning method and generating a feature vector of each node;
the structure perception module is used for extracting the structure information of the control flow graph by using a graph converters model, aggregating the characteristic information of the nodes and the structure information of the control flow graph, and generating graph embedding;
and the similarity probability calculation module is used for embedding the feature vectors of the two binary functions, namely the graph of the control flow graph, and connecting and inputting the feature vectors to the full-connection layer to obtain the probability that the two binary functions are similar.
CN202310931335.4A 2023-07-27 2023-07-27 Binary function similarity detection method and system based on graph transformations Active CN116663004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310931335.4A CN116663004B (en) 2023-07-27 2023-07-27 Binary function similarity detection method and system based on graph transformations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310931335.4A CN116663004B (en) 2023-07-27 2023-07-27 Binary function similarity detection method and system based on graph transformations

Publications (2)

Publication Number Publication Date
CN116663004A true CN116663004A (en) 2023-08-29
CN116663004B CN116663004B (en) 2023-09-29

Family

ID=87720918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310931335.4A Active CN116663004B (en) 2023-07-27 2023-07-27 Binary function similarity detection method and system based on graph transformations

Country Status (1)

Country Link
CN (1) CN116663004B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688150A (en) * 2019-09-03 2020-01-14 华中科技大学 Binary file code search detection method and system based on tensor operation
CN112163226A (en) * 2020-11-30 2021-01-01 中国人民解放军国防科技大学 Binary similarity detection method based on graph automatic encoder
CN112733137A (en) * 2020-12-24 2021-04-30 哈尔滨工业大学 Binary code similarity analysis method for vulnerability detection
US20210216577A1 (en) * 2020-01-13 2021-07-15 Adobe Inc. Reader-retriever approach for question answering
CN115113877A (en) * 2022-07-06 2022-09-27 上海交通大学 Cross-architecture binary code similarity detection method and system
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN116032654A (en) * 2023-02-13 2023-04-28 山东省计算中心(国家超级计算济南中心) Firmware vulnerability detection and data security management method and system
WO2023072421A1 (en) * 2021-10-29 2023-05-04 NEC Laboratories Europe GmbH System and method for inductive learning on graphs with knowledge from language models

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688150A (en) * 2019-09-03 2020-01-14 华中科技大学 Binary file code search detection method and system based on tensor operation
US20210216577A1 (en) * 2020-01-13 2021-07-15 Adobe Inc. Reader-retriever approach for question answering
CN112163226A (en) * 2020-11-30 2021-01-01 中国人民解放军国防科技大学 Binary similarity detection method based on graph automatic encoder
CN112733137A (en) * 2020-12-24 2021-04-30 哈尔滨工业大学 Binary code similarity analysis method for vulnerability detection
WO2023072421A1 (en) * 2021-10-29 2023-05-04 NEC Laboratories Europe GmbH System and method for inductive learning on graphs with knowledge from language models
CN115113877A (en) * 2022-07-06 2022-09-27 上海交通大学 Cross-architecture binary code similarity detection method and system
CN115858002A (en) * 2023-02-06 2023-03-28 湖南大学 Binary code similarity detection method and system based on graph comparison learning and storage medium
CN116032654A (en) * 2023-02-13 2023-04-28 山东省计算中心(国家超级计算济南中心) Firmware vulnerability detection and data security management method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
崔宝江;马丁;郝永乐;王建新;: "基于基本块签名和跳转关系的二进制文件比对技术", 清华大学学报(自然科学版), no. 10 *
王工博;蒋烈辉;司彬彬;董卫宇;: "基于栈结构恢复的固件栈溢出漏洞相似性检测", 信息工程大学学报, no. 02, pages 1 - 10 *
陈昱;刘中金;赵威威;马原;石志强;孙利民;: "一种大规模的跨平台同源二进制文件检索方法", 计算机研究与发展, no. 07, pages 1 - 10 *

Also Published As

Publication number Publication date
CN116663004B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN111459491B (en) Code recommendation method based on tree neural network
CN113010209A (en) Binary code similarity comparison technology for resisting compiling difference
Gao et al. Semantic learning based cross-platform binary vulnerability search for IoT devices
Lee et al. Visual question answering over scene graph
CN115951883B (en) Service component management system of distributed micro-service architecture and method thereof
CN116336400B (en) Baseline detection method for oil and gas gathering and transportation pipeline
CN115344863A (en) Malicious software rapid detection method based on graph neural network
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN116405326A (en) Information security management method and system based on block chain
CN113904844B (en) Intelligent contract vulnerability detection method based on cross-mode teacher-student network
Kim et al. A comparative study on performance of deep learning models for vision-based concrete crack detection according to model types
CN114723003A (en) Event sequence prediction method based on time sequence convolution and relational modeling
CN116861431B (en) Malicious software classification method and system based on multichannel image and neural network
CN116663004B (en) Binary function similarity detection method and system based on graph transformations
CN115240120B (en) Behavior identification method based on countermeasure network and electronic equipment
CN117009968A (en) Homology analysis method and device for malicious codes, terminal equipment and storage medium
CN114997360B (en) Evolution parameter optimization method, system and storage medium of neural architecture search algorithm
CN116958809A (en) Remote sensing small sample target detection method for feature library migration
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
CN115758362A (en) Multi-feature-based automatic malicious software detection method
CN116226864A (en) Network security-oriented code vulnerability detection method and system
CN116628695A (en) Vulnerability discovery method and device based on multitask learning
Li et al. STADE-CDNet: Spatial–Temporal Attention With Difference Enhancement-Based Network for Remote Sensing Image Change Detection
CN113986251A (en) GUI prototype graph code conversion method based on convolution and cyclic neural network
CN114091021A (en) Malicious code detection method for electric power enterprise safety protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant