CN115587358A - Binary code similarity detection method and device and storage medium - Google Patents

Binary code similarity detection method and device and storage medium Download PDF

Info

Publication number
CN115587358A
CN115587358A CN202110761988.3A CN202110761988A CN115587358A CN 115587358 A CN115587358 A CN 115587358A CN 202110761988 A CN202110761988 A CN 202110761988A CN 115587358 A CN115587358 A CN 115587358A
Authority
CN
China
Prior art keywords
features
code
control flow
node
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110761988.3A
Other languages
Chinese (zh)
Inventor
张玉亭
樊期光
彭华熹
刘祖臣
石松泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110761988.3A priority Critical patent/CN115587358A/en
Publication of CN115587358A publication Critical patent/CN115587358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device and a storage medium for detecting similarity of binary codes, wherein the method comprises the following steps: acquiring a control flow chart of a function of a binary firmware file; extracting semantic information to obtain a code block embedded vector of the control flow chart; the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined; fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector; function similarity is calculated by the graph embedding vector. By adopting the invention, the detection accuracy can be better improved. The cross-architecture semantic feature extraction can be expanded at will, and the accuracy of binary code similarity comparison can be improved.

Description

Binary code similarity detection method and device and storage medium
Technical Field
The present invention relates to the field of information security, and in particular, to a method and an apparatus for detecting similarity of binary codes, and a storage medium.
Background
Binary code analysis is one of the very important research areas in the field of information security, where one class of goals is to detect similar binary functions without accessing the source code. Due to the heterogeneous characteristics of the internet of things equipment, functions compiled by the same program on different platforms are different, and in order to find the relationship between the vulnerabilities, the relationship between the codes needs to be found, and whether the vulnerabilities, the malicious code plagiarisms and the like exist is determined.
The current binary file similarity comparison scheme mainly comprises: there are a conventional method and a deep learning-based method according to characteristics.
The traditional method is mainly characterized in that the characteristics of an arithmetic instruction number, a function call number, a character string constant, a numerical constant and the like are counted from an assembly code, and then the similarity of two binary files is calculated by using a graph matching algorithm. The traditional methods mainly comprise SIGMA, discovRE and the like.
The deep learning-based method mainly extracts function characteristics of the assembly instruction or the control flow chart through a deep neural network, and finally calculates the similarity of binary files by using cosine distances. Binary instruction based methods such as SAFE, asm2vec, etc.; methods based on control flow charts, such as GEMINI, SANN, etc.
The method has the defect that the similarity comparison accuracy of the current binary files can be improved.
Disclosure of Invention
The invention provides a binary code similarity detection method, a binary code similarity detection device and a storage medium, which are used for improving the similarity comparison accuracy of binary files.
The invention provides the following technical scheme:
a binary code similarity detection method comprises the following steps:
acquiring a control flow chart of a function of a binary firmware file;
extracting semantic information to obtain a code block embedded vector of the control flow chart;
the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined;
fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
function similarity is calculated by the graph embedding vector.
In implementation, the control flow of the function for obtaining the binary firmware file is obtained by IDA.
In implementation, semantic information is extracted to obtain code block embedded vectors of the control flow graph, and the code block embedded vectors are extracted by using BERT.
In an implementation, the method further comprises the following steps:
the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification information of different platforms is learned through the node classification and the node clustering information.
In implementation, the method is used for connecting the link relation among nodes in the prediction learning code through link prediction, node classification and node clustering information, and learning the node category information of different platforms through node classification and node clustering, and comprises the following steps:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
In the implementation, the deep semantic features of the control flow chart are obtained by embedding vectors into the code blocks, and the deep semantic features are obtained through a message passing neural network; and/or the presence of a gas in the gas,
the order-aware feature of the code block embedded vector is determined by a pointer network.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
the input codes have the node characteristics of connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
adopting a Pointer network, and using an LSTM to encode a branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
In the implementation, when the function similarity is calculated by the graph embedding vector, the method further comprises the following steps:
graph embedding vector loss is reduced using twin neural network training.
In implementation, the function similarity is calculated through the graph embedding vector, and the function similarity comprises the following steps:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
A binary code similarity detection apparatus comprising:
a processor for reading the program in the memory, performing the following processes:
acquiring a control flow chart of a function of a binary firmware file;
extracting semantic information to obtain a code block embedded vector of the control flow chart;
the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined;
fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
calculating function similarity through the graph embedding vector;
a transceiver for receiving and transmitting data under the control of the processor.
In implementation, the control flow of the function for obtaining the binary firmware file is obtained by IDA.
In implementation, semantic information is extracted to obtain code block embedded vectors of the control flow graph, and the code block embedded vectors are extracted by using BERT.
In an implementation, the method further comprises the following steps:
the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification information of different platforms is learned through the node classification and the node clustering information.
In the implementation, the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification and the node clustering information of different platforms are learned through the node classification and the node clustering, and the method comprises the following steps:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and (3) strong correlation, weak correlation or complete irrelevance, BCf classifies the blocks, BCt clusters the blocks, and is used for judging platforms, compilers and optimization options to which the blocks belong.
In the implementation, the deep semantic features of the control flow chart are obtained by embedding vectors into the code blocks, and the deep semantic features are obtained through a message passing neural network; and/or the presence of a gas in the atmosphere,
the order-aware feature of the code block embedded vector is determined by a pointer network.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
the input codes have the node characteristics of connection relation, and the output nodes are connected in sequence;
and using the intermediate-layer features as sequence features to learn the sequence relation among the nodes.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
adopting a Pointer network, and using an LSTM to encode one branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
In the implementation, when the function similarity is calculated by the graph embedding vector, the method further comprises the following steps:
graph embedding vector loss is reduced using twin neural network training.
In implementation, the function similarity is calculated through the graph embedding vector, and the function similarity comprises the following steps:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
A binary code similarity detection apparatus comprising:
the flow chart module is used for acquiring a control flow chart of a function of the binary firmware file;
the embedded vector module is used for extracting semantic information to obtain a code block embedded vector of the control flow chart;
the semantic and sequence module is used for acquiring the depth semantic features of the control flow chart by embedding the code block into the vector and determining the sequence perception features of the code block embedded vector;
the fusion module is used for fusing the depth semantic features and the sequence perception features to obtain a graph embedding vector;
and the similarity module is used for calculating the similarity of the functions through the graph embedding vectors.
In an implementation, the flow chart module is further configured to obtain a control flow chart of a function of the binary firmware file through the IDA.
In implementation, the embedded vector module is further configured to obtain a code block embedded vector of the control flow graph by extracting semantic information using BERT.
In implementation, the semantic and sequence module is further used for connecting the link relation among the nodes in the prediction learning code through link prediction, node classification and node clustering information, and learning the node category information of different platforms through node classification and node clustering.
In an implementation, the semantic and sequence module is further configured to learn node category information of different platforms through node classification and node clustering by using link prediction, node classification and node clustering information to connect link relations between nodes in the prediction learning code:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
In implementation, the semantic and sequence module is further configured to obtain a deep semantic feature of the control flow graph through a message passing neural network; and/or determining, by a network of pointers, a sequential perceptual feature of the code block embedded vector.
In an implementation, the semantic and sequence module, when determining the sequential perceptual features of the code block embedded vector, is further configured to:
the input codes have the node characteristics of the connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
In an implementation, the semantic and sequence module, when determining the sequential perceptual features of the code block embedded vector, is further configured to:
adopting a Pointer network, and using an LSTM to encode a branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
In an implementation, the similarity module is further configured to reduce a graph embedding vector loss using twin neural network training when calculating the functional similarity through the graph embedding vector.
In an implementation, the similarity module is further configured to, when calculating the functional similarity through the graph embedding vector, include:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
A computer-readable storage medium storing a computer program for executing the above-described binary code similarity detection method.
The invention has the following beneficial effects:
in the technical scheme provided by the embodiment of the invention, the depth semantic features of the control flow chart are obtained for the code block embedded vector, the image embedded vector is obtained by fusion after the sequence perception features of the code block embedded vector are determined, and then similarity comparison is carried out.
Furthermore, by extracting a graph embedded vector frame and introducing sequence information among pointer network learning nodes, cross-architecture semantic feature extraction can be expanded at will, and the accuracy of binary code similarity comparison can be improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a first schematic flow chart illustrating an implementation of a binary code similarity detection method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an implementation flow of a binary code similarity detection method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a semantic information extraction model according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a sequential perception model according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a binary code similarity detection apparatus according to an embodiment of the present invention.
Detailed Description
The inventor notices in the process of invention that:
the deep learning method based on the control flow chart is analyzed, and at least one of the following defects is found:
1. although semantic information can be learned by the deep learning-based features, sequence information between blocks in the control flow diagram cannot be captured more accurately.
2. The number of input code blocks of the present neural network exceeding a certain threshold may be discarded, resulting in information loss.
3. Although the deep learning-based method can learn high-level semantic information from a large amount of cross-platform data, unsupervised training can only rely on the model itself and cannot fully learn the cross-platform information.
4. The prior art has low expandability and poor flexibility.
Based on the above, the technical scheme provided by the embodiment of the invention provides a cross-platform binary code similarity detection scheme with a depth semantic feature and a sequence perception feature fused, aiming at improving the binary code similarity comparison accuracy and improving the cross-platform expandability. The purpose is to solve at least one or a combination of the following problems:
1. the node sequence information in the control flow chart can be accurately learned through the pointer network.
2. The input length of the pointer network is any length, all code block information of the whole function can be learned, and the problem of code block information loss caused by the fact that the input is fixed length is solved.
3. By adding node link relation, node classification and node clustering information, the BERT network is supervised and trained, so that cross-platform information is learned from code block vectors, and the cross-platform problem of binary code similarity detection is solved.
4. Function characteristic vectors are automatically extracted, different platforms can be expanded only by adding training data and training labels, and flexibility of the scheme is improved.
Based on this, in the technical solution provided in the embodiment of the present invention, a cross-platform binary code similarity detection scheme based on fusion of a depth semantic feature and a sequential perceptual feature is provided, and a specific implementation of the present invention is described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a first implementation flow of a binary code similarity detection method, as shown in the figure, the method may include:
step 101, acquiring a control flow chart of a function of a binary firmware file;
102, extracting semantic information to obtain a code block embedded vector of a control flow chart;
103, acquiring depth semantic features of the control flow chart by embedding the vector into the code block, and determining sequence perception features of the embedded vector of the code block;
step 104, integrating the depth semantic features and the sequential perception features to obtain a graph embedding vector;
and 105, calculating function similarity through the graph embedding vector.
Fig. 2 is a schematic diagram of an implementation flow of a binary code similarity detection method, as shown in the figure, the method may include:
acquiring a binary firmware file;
unpacking the binary firmware file, disassembling the executable file contained in the binary firmware file, and extracting a control flow chart of each function in all the executable files;
converting the control flow chart into a natural language processing task, and pre-training a BERT network to obtain block codes;
constructing a message transmission neural network based on block coding, and training a sequence perception model to obtain sequence perception characteristics;
training a deep neural network based on fusion of deep semantic features and sequence perception features to obtain an embedded vector of a control flow chart;
and comparing the functions from different binary firmware files, and judging whether the same functions exist in the firmware and whether the firmware codes are similar according to the function similarity result.
Specifically, the method mainly comprises the following parts:
1. acquiring control flow charts of all functions of the binary firmware file;
2. and the semantic information extraction module is used for extracting semantic information by using BERT to obtain code block embedded vectors. Link prediction, node classification and node clustering information are added, the link relation among nodes in the prediction learning code is connected, and the node classification and node clustering learn node category information of different platforms;
3. embedding a vector into a code block through a message passing neural network to learn deep semantic features so as to obtain the deep semantic features, wherein a module for implementing the function is called as: a graphic meaning information extraction module;
4. and embedding vectors into the code blocks through a pointer network to obtain sequential perception characteristics. The input is the node characteristics with connection relation in the code, the output is the connection sequence of the nodes, the intermediate layer characteristics are used as the sequence characteristics, the sequence relation between the nodes is learned, and the module for implementing the function is called as: a sequential perception feature extraction module;
5. fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
6. reducing graph embedding vector loss using twin neural network training on the graph embedding vector extraction model;
7. graph similarity was calculated using cosine distances.
The implementation of each link is described again below with reference to examples.
1. A control flow diagram for all functions of the binary firmware file is obtained.
In implementation, the control flow diagram of the function for obtaining the binary firmware file is obtained by IDA.
Specifically, the function control flowchart may be obtained by IDA (Interactive Disassembler) or the like.
2. And a semantic information extraction module.
In implementation, semantic information is extracted to obtain code block embedded vectors of the control flow graph, and the code block embedded vectors are extracted by using BERT.
Fig. 3 is a schematic diagram of a semantic information extraction model, and as shown in the drawing, in implementation, BERT (Bidirectional Encoder representation of transformer) may be used to pre-train a control flow graph, so as to obtain semantic information of a code block.
BERT was originally used in the field of NLP (Natural Language Processing) to pre-train words and sentences. The task in the embodiment is similar to the NLP task, a code block (block) of the control flow graph can be regarded as a sentence, and tokens (tokens) in the block can be regarded as words. The method adopts BERT to extract word embedded vectors from a control flow chart, and mainly comprises 5 tasks: maskNM (Masked Node Modeling), ANP (adjacent Node Prediction), LP (Link Prediction), BCf (Block classification), and BCt (Block clustering).
MNM and ANP are, among other things, two tasks similar to those in the original paper of BERT. The MNM is a token-level task, performs mask (shielding) operation on tokens in the block and performs prediction, and the method is the same as that of a language model. The ANP task is a block-level task, although the control flow graph has no language sequence in the NLP field, the control flow graph is a directed graph and also has a topological sequence of nodes, so that all adjacent nodes in the control flow graph can be extracted and used as adjacent sentences. These adjacent block pairs are used as positive examples of ANP tasks, and non-adjacent block pairs within the same graph are randomly selected as negative examples.
In the implementation, the method can further comprise the following steps:
the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification information of different platforms is learned through the node classification and the node clustering information.
In specific implementation, the method can comprise the following steps:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
Specifically, in order to obtain more graph-level information, three auxiliary graph-level tasks LP, BCf and BCt are added in the implementation.
LP and ANP are similar in manner, except that pair is selected differently. The purpose of the LP task is for the model to determine the link relationship between two blocks: strong correlation, weak correlation and complete non-correlation to make the model as learned of this information as possible, and thus help the graph-level task. Therefore, in the LP task, the block pair in direct link relation with the graph is strongly correlated, the block pair in indirect link relation with the graph is weakly correlated, and the block pairs in different graphs are completely uncorrelated.
BCf and BCt are block classification tasks of graph-level, and in the scenario of the embodiment, the obtained block information is different under the conditions of different platforms, different compilers and different optimization options, and the purpose in implementation is to enable the model to enable block embedding (block embedding) to contain the information. BCf classifies blocks, BCt clusters blocks, and judges which platform a block belongs to, which compiler, and which optimization option.
3. Picture meaning information extraction module
In implementation, the deep semantic features of the control flow graph are obtained by embedding vectors into code blocks, and the deep semantic features are obtained through a message passing neural network.
Specifically, after BERT pre-training, graphical meaning information of a control flow graph is calculated by using a message passing neural network MPNN.
4. And a sequential perception feature extraction module.
In implementation, the order-aware feature of the code block embedded vector is determined by a pointer network.
In a specific implementation, determining sequential perceptual features of a code block embedded vector includes:
the input codes have the node characteristics of connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
In a specific implementation, determining sequential perceptual features of a code block embedded vector includes:
adopting a Pointer network, and using an LSTM to encode a branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
FIG. 4 is a schematic diagram of a sequential perception model, as shown, this module is used to extract information about the order of nodes, since there is no output vocabulary in this task, and each element in the output sequence is from an element in the input. Therefore, in an implementation, a Pointer network may be used, and an LSTM (Long Short-Term Memory) is used to encode a branch in the same graph, and when decoding, it will point to an element in the input at each time step as the output of the current time step.
The information considered by the decoder at each time step includes three: block semantic features, relational features and node features output at the last time step from the Bert network. At each time step, the network will screen out the nodes that have been predicted to output, and the decoder will select the node with the highest probability as the current output. The network input is arbitrarily variable in length, and the sequence information of all code blocks in the function can be captured. Through supervised learning, the connection information of the block in the control flow chart can be learned more accurately, and finally hidden layer characteristics can be taken as sequential perception characteristics.
5. And fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector.
6. The graph embedding vector loss is reduced using twin neural network training on the graph embedding vector extraction model.
7. Calculating function similarity through the graph embedding vector, comprising:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
Specifically, the cosine distance may be used to calculate the function similarity, traverse a plurality of functions in the firmware, and weight each function similarity score as a firmware similarity score, thereby determining whether a bug, a malicious code plagiarism, or the like exists.
Based on the same inventive concept, the embodiment of the present invention further provides a binary code similarity detection apparatus and a computer-readable storage medium, and because the principles of these apparatuses for solving the problems are similar to the binary code similarity detection method, the implementation of these apparatuses can refer to the implementation of the method, and the repeated details are not repeated.
When the technical scheme provided by the embodiment of the invention is implemented, the implementation can be carried out as follows.
Fig. 5 is a schematic structural diagram of a binary code similarity detection apparatus, as shown in the figure, the apparatus includes:
the processor 500, which is used to read the program in the memory 520, executes the following processes:
acquiring a control flow chart of a function of a binary firmware file;
extracting semantic information to obtain a code block embedded vector of the control flow chart;
the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined;
fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
calculating function similarity through the graph embedding vector;
a transceiver 510 for receiving and transmitting data under the control of the processor 500.
In implementation, the control flow of the function for obtaining the binary firmware file is obtained by IDA.
In implementation, semantic information is extracted to obtain code block embedded vectors of the control flow graph, and the code block embedded vectors are extracted by using BERT.
In an implementation, the method further comprises the following steps:
the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification information of different platforms is learned through the node classification and the node clustering information.
In the implementation, the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification and the node clustering information of different platforms are learned through the node classification and the node clustering, and the method comprises the following steps:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
In the implementation, the deep semantic features of the control flow chart are obtained by embedding vectors into the code blocks, and the deep semantic features are obtained through a message passing neural network; and/or the presence of a gas in the gas,
the order-aware feature of the code block embedded vector is determined by a pointer network.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
the input codes have the node characteristics of connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
In an implementation, determining sequential perceptual features of a code block embedded vector includes:
adopting a Pointer network, and using an LSTM to encode a branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
In the implementation, when the function similarity is calculated by the graph embedding vector, the method further comprises the following steps:
using twin neural network training reduces map embedding vector loss.
In implementation, the function similarity is calculated through the graph embedding vector, and the function similarity comprises the following steps:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
Wherein in fig. 5, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 500, and various circuits, represented by memory 520, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 510 may be a number of elements including a transmitter and a receiver that provide a means for communicating with various other apparatus over a transmission medium. The processor 500 is responsible for managing the bus architecture and general processing, and the memory 520 may store data used by the processor 500 in performing operations.
The embodiment of the invention also provides a binary code similarity detection device, which comprises:
the flow chart module is used for acquiring a control flow chart of a function of the binary firmware file;
the embedded vector module is used for extracting semantic information to obtain a code block embedded vector of the control flow chart;
the semantic and sequence module is used for acquiring the depth semantic features of the control flow chart by embedding the code block into the vector and determining the sequence perception features of the code block embedded vector;
the fusion module is used for fusing the depth semantic features and the sequence perception features to obtain a graph embedding vector;
and the similarity module is used for calculating the function similarity through the graph embedding vector.
In an implementation, the flow chart module is further configured to obtain a control flow chart of a function of the binary firmware file through the IDA.
In implementation, the embedded vector module is further configured to obtain a code block embedded vector of the control flow graph by extracting semantic information using BERT.
In implementation, the semantic and sequence module is further used for connecting the link relation among the nodes in the prediction learning code through link prediction, node classification and node clustering information, and learning the node category information of different platforms through node classification and node clustering.
In an implementation, the semantic and sequence module is further configured to learn node category information of different platforms through node classification and node clustering by linking prediction, node classification and node clustering information to connect the link relationship between nodes in the prediction learning code, and includes:
adding graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
In implementation, the semantic and sequence module is further configured to obtain a deep semantic feature of the control flow graph through a message passing neural network; and/or determining, by a network of pointers, a sequential perceptual feature of the code block embedded vector.
In an implementation, the semantic and sequence module, when determining the sequential perceptual features of the code block embedded vector, is further configured to:
the input codes have the node characteristics of connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
In an implementation, the semantic and sequence module, when determining the sequential perceptual features of the code block embedded vector, is further configured to:
adopting a Pointer network, and using an LSTM to encode one branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the Bert network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily variable-length, the sequence information of all code blocks in the function is captured.
In an implementation, the similarity module is further configured to reduce a graph embedding vector loss using twin neural network training when calculating the functional similarity through the graph embedding vector.
In an implementation, the similarity module is further configured to, when calculating the functional similarity through the graph embedding vector, include:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
For convenience of description, each part of the above-described apparatus is separately described as being functionally divided into various modules or units. Of course, the functionality of the various modules or units may be implemented in the same one or more pieces of software or hardware in practicing the invention.
The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the binary code similarity detection method.
The specific implementation can be seen in the implementation of the binary code similarity detection method.
In summary, in the technical solution provided by the embodiment of the present invention, the sequence sensing module is provided to capture the sequence information between the blocks in the control flow chart more accurately by monitoring and training the pointer network.
Furthermore, semantic information extraction is provided, link prediction, node classification and node clustering are introduced, and cross-platform information is fully captured from the feature vector through supervised learning.
Furthermore, the binary file comparison method based on the fusion of the depth semantic information and the sequence features enables the graph embedding vector to contain richer information through the whole framework process, and the detection accuracy can be improved better.
Furthermore, the scheme can be applied to the fields of vulnerability detection, code plagiarism, software theft and the like.
Therefore, by the proposed learning of the sequence characteristics between the nodes through the pointer network, the sequence perception characteristics can be fully learned to the sequence relation between the nodes through supervised learning.
By monitoring and training the pointer network, code block vectors with any length are input, code blocks in any function cannot be discarded, and sequence information and high-level semantic information of all code blocks can be learned.
By introducing link prediction, node classification and node clustering information and supervised training of BERT, the code block embedded vector is rich in information of various cross-platforms, cross-compilers and the like, and cross-architecture information can be arbitrarily expanded.
By extracting the graph embedded vector frame and introducing sequence information among pointer network learning nodes, the extraction of cross-architecture semantic features can be expanded at will, and the accuracy of binary code similarity comparison can be improved.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (13)

1. A binary code similarity detection method is characterized by comprising the following steps:
acquiring a control flow chart of a function of a binary firmware file;
extracting semantic information to obtain a code block embedded vector of the control flow chart;
the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined;
fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
and calculating function similarity through the graph embedding vector.
2. The method of claim 1, wherein the control flow graph of the function that obtains the binary firmware file is obtained by an interactive disassembler IDA.
3. The method of claim 1, wherein extracting semantic information to obtain code block embedded vectors of a control flow graph is extracted by representing BERT using a bi-directional encoder of a transformer.
4. The method of claim 3, further comprising:
the link prediction, the node classification and the node clustering information are used for connecting the link relation among the nodes in the prediction learning code, and the node classification information of different platforms is learned through the node classification and the node clustering information.
5. The method of claim 4, wherein the learning of node category information for different platforms by node classification and node clustering by linking prediction, node classification and node clustering information for linking the link relationships between nodes in the prediction learning code comprises:
adding graph level graph-level task link prediction LP, node classification BCf and node clustering information BCt, wherein the LP is used for judging that the link relation between two block blocks is one of the following relations: and the block is classified by BCf and clustered by BCt, and the blocks are used for judging the platform, the compiler and the optimization option to which the blocks belong.
6. The method of claim 1, wherein obtaining deep semantic features of a control flow graph by embedding vectors into code blocks is obtained through a message passing neural network; and/or the presence of a gas in the atmosphere,
the order-aware feature of the code block embedded vector is determined by a pointer network.
7. The method of claim 6, wherein determining sequential perceptual features of a code block embedded vector comprises:
the input codes have the node characteristics of the connection relation, and the output nodes are connected in sequence;
the order relationship between nodes is learned using the intermediate-level features as order features.
8. The method of claim 7, wherein determining sequential perceptual features of a code block embedded vector comprises:
adopting a Pointer network, and using long-short term memory (LSTM) to encode a branch in the same control flow chart;
when decoding, it points to a certain element in the input at each time step as the output of the current time step, wherein the information processed by the decoder at each time step includes: block semantic features, relation features and node features output at the last time step from the BERT network;
at each time step, the nodes which are already predicted to output are screened out, and the decoder selects the node with the maximum probability as the current output;
when the input is arbitrarily long, the sequence information of all code blocks in the function is captured.
9. The method of claim 1, wherein in computing functional similarity through graph embedding vectors, further comprising:
graph embedding vector loss is reduced using twin neural network training.
10. The method of claim 9, wherein calculating functional similarity through graph embedding vectors comprises:
and calculating function similarity by using the cosine distance, traversing a plurality of functions in the firmware, weighting each function similarity score, and calculating the function similarity after serving as the firmware similarity score.
11. A binary code similarity detection apparatus, comprising:
a processor for reading the program in the memory, performing the following processes:
acquiring a control flow chart of a function of a binary firmware file;
extracting semantic information to obtain a code block embedded vector of the control flow chart;
the depth semantic features of the control flow chart are obtained by embedding the vectors into the code blocks, and the sequential perception features of the embedded vectors of the code blocks are determined;
fusing the depth semantic features and the sequential perception features to obtain a graph embedding vector;
calculating function similarity through the graph embedding vector;
a transceiver for receiving and transmitting data under the control of the processor.
12. A binary code similarity detection apparatus, comprising:
the flow chart module is used for acquiring a control flow chart of a function of the binary firmware file;
the embedded vector module is used for extracting semantic information to obtain a code block embedded vector of the control flow chart;
the semantic and sequence module is used for acquiring the depth semantic features of the control flow chart by embedding the code block into the vector and determining the sequence perception features of the code block embedded vector;
the fusion module is used for fusing the depth semantic features and the sequence perception features to obtain a graph embedding vector;
and the similarity module is used for calculating the similarity of the functions through the graph embedding vectors.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 10.
CN202110761988.3A 2021-07-06 2021-07-06 Binary code similarity detection method and device and storage medium Pending CN115587358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110761988.3A CN115587358A (en) 2021-07-06 2021-07-06 Binary code similarity detection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110761988.3A CN115587358A (en) 2021-07-06 2021-07-06 Binary code similarity detection method and device and storage medium

Publications (1)

Publication Number Publication Date
CN115587358A true CN115587358A (en) 2023-01-10

Family

ID=84772148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110761988.3A Pending CN115587358A (en) 2021-07-06 2021-07-06 Binary code similarity detection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN115587358A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115951931A (en) * 2023-03-14 2023-04-11 山东大学 Binary code similarity detection method based on BERT

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115951931A (en) * 2023-03-14 2023-04-11 山东大学 Binary code similarity detection method based on BERT

Similar Documents

Publication Publication Date Title
CN111475649B (en) False news prediction method, system, device and medium based on deep learning
CN107844481B (en) Text recognition error detection method and device
CN112579477A (en) Defect detection method, device and storage medium
CN109376535B (en) Vulnerability analysis method and system based on intelligent symbolic execution
CN112668013B (en) Java source code-oriented vulnerability detection method for statement-level mode exploration
CN115587358A (en) Binary code similarity detection method and device and storage medium
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN110866172B (en) Data analysis method for block chain system
CN116776157A (en) Model learning method supporting modal increase and device thereof
CN116227603A (en) Event reasoning task processing method, device and medium
CN116720184A (en) Malicious code analysis method and system based on generation type AI
CN115048929A (en) Sensitive text monitoring method and device
CN115883878A (en) Video editing method and device, electronic equipment and storage medium
CN113836297A (en) Training method and device for text emotion analysis model
CN113537372B (en) Address recognition method, device, equipment and storage medium
CN113392221B (en) Method and related device for processing thin entity
CN113139187B (en) Method and device for generating and detecting pre-training language model
CN117076596B (en) Data storage method, device and server applying artificial intelligence
CN110795941B (en) Named entity identification method and system based on external knowledge and electronic equipment
CN118036008B (en) Malicious file disguising detection method
CN116578989B (en) Intelligent contract vulnerability detection system and method based on deep pre-training neural network
CN110674497B (en) Malicious program similarity calculation method and device
CN110795940B (en) Named entity identification method, named entity identification system and electronic equipment
CN115269367A (en) Vulnerability detection method based on Transformer model
CN117556047A (en) Medical record information classification method, medical record information classification device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination