CN112596736B

CN112596736B - Semantic-based cross-instruction architecture binary code similarity detection method

Info

Publication number: CN112596736B
Application number: CN202011552657.0A
Authority: CN
Inventors: 王莘; 姜训智; 董少波
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-10-08
Anticipated expiration: 2040-12-24
Also published as: CN112596736A

Abstract

The invention discloses a semantic-based method for detecting similarity of binary codes of a cross-instruction architecture. The invention relates to the technical field of vulnerability detection, which constructs assembly language data sets of different instruction architectures and extracts assembly code pairs of different instruction architectures with the same semantics; normalizing the assembly code basic block pair; pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task; carrying out negative sampling and basic block embedding of character string semantics, aligning and splicing a character string embedding vector and an embedding vector obtained by an embedding network to form a new basic block embedding vector; training an embedded network, and outputting embedded vectors of x86 and ARM; the invention can accelerate the accuracy and efficiency of vulnerability detection, copyright dispute and the like which need to compare the basic blocks of different instruction frameworks.

Description

Semantic-based cross-instruction architecture binary code similarity detection method

Technical Field

The invention relates to the technical field of vulnerability detection, in particular to a semantic-based method for detecting similarity of binary codes of a cross-instruction architecture.

Background

The similarity research of the prior binary codes with cross-instruction architectures generally needs to manually select the characteristics of the binary codes to carry out basic block embedding, and the characteristics not only need professional knowledge, but also have less embedded information and cannot completely express the semantics of the binary codes. Such as Gemini, Genius.

To solve the above problem, methods such as SAFE, Asm2vec, etc. apply a method based on static word representation to binary codes. Operators and operands in the basic blocks are expressed as fixed-dimension vectors, and each element in the basic block embedding is an automatically calculated real number, so that the information capacity in the basic block embedding is greatly improved. These approaches are not suitable for basic block embedding across instruction architectures because elements in different instruction architectures cannot appear in the same context.

Disclosure of Invention

The invention is used for detecting the similarity of binary codes among different instruction architectures (such as x86, ARM and the like), can be used in the fields of vulnerability detection, copyright dispute, malicious software analysis and the like, and provides a semantic-based method for detecting the similarity of the binary codes of the cross-instruction architecture, which provides the following technical scheme:

a semantic-based cross-instruction architecture binary code similarity detection method comprises the following steps:

step 1: constructing assembly language data sets of different instruction architectures, and extracting assembly code pairs of different instruction architectures with the same semantics;

step 2: normalizing the assembly code basic block pair;

and step 3: pre-training an assembly code encoder, and converting an assembly code semantic similarity task into a natural language neural machine translation task;

and 4, step 4: carrying out negative sampling to enable the model to distinguish basic blocks with different semantics but similar embedded vectors;

and 5: combining the basic block embedding of the character string semantics, aligning and splicing the character string embedding vector and the embedding vector obtained by the embedding network to form a new basic block embedding vector;

step 6: training an embedded network, and outputting embedded vectors of x86 and ARM;

and 7: and performing binary code similarity comparison, and judging whether the binary codes of two different instruction architectures are similar.

Preferably, the step 1 specifically comprises:

the method comprises the steps of embedding basic blocks into assembly code basic block pairs with the same semantics, compiling source codes into intermediate language IR with basic block boundary marks by using a modified LLVM compiler, generating assembly languages with different instruction architectures by using different back ends of the LLVM after optimization by an optimizer, wherein each basic block has a unique identifier and the boundary is marked, and extracting the assembly code pairs with the same semantics and different instruction architectures through regular filtering.

Preferably, the step 2 specifically comprises:

for the register, the registers in x86 are classified into 14 types, including a pointer register, a floating point register, four types of general registers, four types of data registers and four types of address registers, wherein the general registers, the data registers and the address registers are classified into 4 types according to the length of data stored by the general registers, the data registers and the address registers, and the types are 8 bits, 16 bits, 32 bits and 64 bits respectively; registers in the ARM are divided into two categories: general purpose registers and pointer registers, to normalize the assembly code of the x86 basic block.

Preferably, the step 3 specifically comprises:

converting the assembly code semantic similarity task into a natural language neural machine translation task, training a current mainstream neural machine translation model Transformer according to the generated assembly code with different instruction architectures and the same semantics, wherein a middle vector generated by the model contains rich assembly code semantic information;

the x86 basic block and the ARM basic block are denoted as S ═ (S), respectively₁,...,s_n) And T ═ T (T)₁,...,t_m) Wherein s is_nAnd t_mRespectively being x86 vocabulary V_x86And ARM vocabulary V_ARMIndex of middle character, Transformer model target is to predict the next character t of ARM basic block_kThe final output is the next character

The loss function L is represented by the following formula:

wherein,

is t_kThe one-hot code of (1);

an x86 encoder is obtained, and the x86 assembly code is embedded into a vector containing semantic features.

Preferably, the step 4 specifically includes:

in advance ofOn the basis of a training model, a vector E is embedded by measuring two x86 basic blocks₁,E₂The similarity is judged by the Euclidean distance, and the Euclidean distance is expressed by the following formula:

where d is the embedding dimension, e_1i∈E₁，e_2i∈E₂；

When an embedded network is trained, a triplet is required to be used, wherein the triplet comprises an anchor point, a positive sample and a negative sample, an x86-ARM basic block pair A is randomly found out and is used as the anchor point and the positive sample, another x86-ARM basic block pair B which is similar to the embedded vector of the basic block pair A but has different semantics is found out by calculating the distance, when the anchor point is the ARM basic block of the basic block pair A, the positive sample is the x86 basic block of the basic block pair A, and the negative sample is the x86 basic block of the basic block pair B; when the anchor point is x86 basic blocks of basic block pair a, the positive samples are ARM basic blocks of basic block pair a, and the negative samples are ARM basic blocks of basic block pair B.

Preferably, the step 5 specifically comprises:

collecting used character strings in the basic blocks, embedding each character string into a vector by using LSTM, calculating final embedding for all character string vectors in each basic block by using Sum Pooling, aligning and splicing the character string embedding vectors with embedding vectors obtained by an embedding network to form a new basic block embedding vector.

Preferably, the step 6 specifically includes:

using the triple samples obtained by negative sampling, and respectively obtaining the positive samples, the negative samples and the semantic embedded vectors of the anchor points after embedding the triple samples into the network model;

respectively splicing character string vectors corresponding to the basic blocks to obtain embedded vectors with semantics and character strings, wherein the embedding of the anchor points is closer to the embedding of positive samples than the embedding of negative samples, and a ternary loss function based on allowance is adopted by the following expression:

L＝max{D(E₁,E₂)-D(E₁,E₃)+γ,0}

wherein E is₁、E₂And E₃Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, D is an Euclidean distance, D is a loss encouragement (E)₁,E₂) At least the ratio of D (E)₁,E₃) And (5) reducing by a margin gamma to obtain a final embedded vector model, inputting x86 and ARM assembly code basic blocks, and outputting x86 and ARM embedded vectors.

Preferably, the step 7 specifically includes:

giving two binary codes of different instruction architectures, obtaining respective assembly codes through disassembling, respectively obtaining semantic embedded vectors of assembly code basic blocks, and calculating Euclidean distance of the embedded vectors to judge whether the binary codes of the two different instruction architectures are similar.

The invention has the following beneficial effects:

the invention combines the semantic features and character string features of the basic block at the same time, and can quickly compare the basic blocks of different instruction architectures by using the pre-trained embedded model. The accuracy and efficiency of comparing basic blocks of different instruction frameworks required by vulnerability detection, copyright dispute and the like can be improved.

Drawings

FIG. 1 is a schematic diagram of LLVM optimization;

FIG. 2 is a schematic illustration of a normalization operation;

FIG. 3 is a flow chart of a semantic-based method for detecting similarity of binary codes across instruction architectures.

Detailed Description

The present invention will be described in detail with reference to specific examples.

The first embodiment is as follows:

according to fig. 1-3, the present invention provides a semantic-based method for detecting similarity of binary codes in a cross-instruction architecture, comprising the following steps:

the step 1 specifically comprises the following steps:

Step 2: normalizing the assembly code basic block pair;

the step 2 specifically comprises the following steps:

the step 3 specifically comprises the following steps:

The loss function L is represented by the following formula:

wherein,

is t_kThe one-hot code of (1);

the step 4 specifically comprises the following steps:

embedding vector E by measuring two x86 basic blocks on the basis of a pre-training model₁,E₂The similarity is judged by the Euclidean distance, and the Euclidean distance is expressed by the following formula:

where d is the embedding dimension, e_1i∈E₁，e_2i∈E₂；

the step 5 specifically comprises the following steps:

the step 6 specifically comprises the following steps:

L＝max{D(E₁,E₂)-D(E₁,E₃)+γ,0}

The step 7 specifically comprises the following steps:

The second embodiment is as follows:

the invention mainly provides a basic block embedding model, then disassembles binary codes of a cross-instruction architecture to obtain assembly code basic blocks of a corresponding architecture, embeds the basic blocks into vectors by using the proposed embedding model, and determines the similarity of the binary codes of different instruction architectures by comparing cosine similarity of the embedded vectors.

In the proposed basic block embedding model, these two embedding modules can be used for basic blocks on any other architecture. X86 and ARM are specified here for ease of description only.

The method comprises the following steps: assembly language data sets of different instruction architectures are constructed.

Training basic block embedding requires a large number of assembly code basic block pairs with the same semantics in preparation for the subsequent semantic embedding of the training basic blocks. The compiled assembly codes of the same source code are considered to be equivalent semantically, so that the source code is compiled into the intermediate language IR with basic block boundary marks by using a modified LLVM compiler, the assembly languages with different instruction architectures are generated by using different back ends of the LLVM after being optimized by an optimizer, each basic block has a unique identifier and the boundary is marked, and assembly code pairs with the same semantics and different instruction architectures can be extracted by simple regular filtering.

Step two: the assembly code basic block pairs are normalized.

Some vocabulary may be out of vocabulary (OOV) if the original assembly code is fed directly into the neural network training semantics. And some registers may be replaced by other registers of the same class without semantic change. Therefore, the assembly code needs to be normalized.

Constants in the assembly code are divided into five types, i.e., immediate, ADDRESS, variable name, function name and basic block tag, which are replaced with characters of 'IMM', 'ADDRESS', 'VAR', 'FUNC', 'BB', etc., respectively.

For registers, registers in x86 are classified into 14 types, including pointer registers, floating point registers, four types of general purpose registers, four types of data registers, and four types of address registers. The general register, the data register and the address register are divided into 4 types according to the length of data stored by the general register, the data register and the address register, and the types are respectively 8 bits, 16 bits, 32 bits and 64 bits. Registers in the ARM are divided into two categories: general purpose registers and pointer registers.

As shown in fig. 1, the assembly code of the x86 basic block on the left is transformed to the right by the normalization operation.

Step three: pre-training of the encoder of the assembly code.

And (3) converting the assembly code semantic similarity task into a natural language neural machine translation task, and using a large number of different instruction architecture assembly codes with the same semantics generated in the step one and the step two to train a current mainstream neural machine translation model Transformer, wherein a middle vector generated by the model contains abundant assembly code semantic information. The x86 basic block and the ARM basic block are denoted as S ═ (S), respectively₁,...,s_n) And T ═ T (T)₁,...,t_m) Wherein s is_iAnd t_iRespectively being x86 vocabulary V_x86And ARM vocabulary V_ARMThe index of the middle character. The Transformer model goal is to predict the next character t of the ARM basic block_kThe final output is the next character

Probability distribution of (2). The loss function is expressed as follows.

Wherein

Is t_kOne-hot encoding of (1). Thus obtainingAn x86 encoder is obtained, and x86 assembly code can be embedded into a vector containing semantic features.

Step four: and (4) negative sampling.

The added samples after negative sampling can enable the model to distinguish basic blocks with different semantics but similar embedded vectors, and the performance of the model is improved. On the basis of a pre-training model, a vector E can be embedded by measuring two x86 basic blocks₁,E₂The Euclidean distance of the two points to judge the similarity of the two points.

Wherein

d is the embedding dimension, e_1i∈E₁，e_2i∈E₂. The smaller the euclidean distance, the higher the similarity of the two basic blocks. The triplet (anchor point, positive sample and negative sample) is used for training the embedded network, and an x86-ARM basic block pair A is randomly found to serve as the anchor point and the positive sample. Another x86-ARM basic block pair B can be found by computing the distance that is similar to the basic block pair a embedded vector but with different semantics. If the anchor point is the ARM basic block of the basic block pair A, the positive samples are x86 basic blocks of the basic block pair A, and the negative samples are x86 basic blocks of the basic block pair B; if the anchor point is the x86 basic block of basic block pair A, then the positive samples are the ARM basic block of basic block pair A and the negative samples are the ARM basic block of basic block pair B.

Step five: basic block embedding in conjunction with string semantics.

This is greatly facilitated by matching two basic blocks, since some of the variables in the assembly code represent strings, and strings in basic blocks for different instruction architectures are mostly identical. The used strings in the basic blocks are first collected, then embedded as a vector for each string using LSTM, and the final embedding is calculated for all string vectors in each basic block using Sum Pooling. Finally, aligning and splicing the character string embedded vector and the embedded vector obtained by the embedded network to form a new basic block embedded vector.

Step six: and (5) training the embedded network.

Firstly, a triple sample obtained by negative sampling is used, and semantic embedding vectors of a positive sample, a negative sample and an anchor point are respectively obtained after a network model is embedded. And then respectively splicing the character string vectors corresponding to the basic blocks to obtain an embedded vector with semantics and character strings. Where the embedding of anchor points should be closer to the embedding of positive samples than the embedding of negative samples. The use of a margin-based ternary loss function is therefore as follows:

L＝max{D(E₁,E₂)-D(E₁,E₃)+γ,0}

wherein E₁、E₂And E₃Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, and D is an Euclidean distance. The loss encourages D (E)₁,E₂) At least the ratio of D (E)₁,E₃) Less by a margin y. Thus, a final embedded vector model is obtained, the assembly code basic blocks of x86 and ARM are input, and the embedded vectors of x86 and ARM are output.

Step seven: and finally comparing the similarity of the binary codes.

Two binary codes of different instruction architectures are given, and the respective assembly codes are obtained through disassembly. Through the embedding model obtained through the six steps, semantic embedding vectors of the assembly code basic block can be respectively obtained, and the Euclidean distance of the embedding vectors can be calculated to judge whether the binary codes of two different instruction architectures are similar or not.

The above description is only a preferred embodiment of the semantic-based inter-instruction-architecture binary code similarity detection method, and the protection scope of the semantic-based inter-instruction-architecture binary code similarity detection method is not limited to the above embodiments, and all technical solutions belonging to the idea belong to the protection scope of the present invention. It should be noted that modifications and variations which do not depart from the gist of the invention will be those skilled in the art to which the invention pertains and which are intended to be within the scope of the invention.

Claims

1. A semantic-based cross-instruction architecture binary code similarity detection method is characterized by comprising the following steps: the method comprises the following steps:

step 2: normalizing the assembly code basic block pair;

the step 4 specifically comprises the following steps:

where d is the embedding dimension, e_1i∈E₁，e_2i∈E₂；

When an embedded network is trained, a triplet is required to be used, wherein the triplet comprises an anchor point, a positive sample and a negative sample, an x86-ARM basic block pair A is randomly found out and is used as the anchor point and the positive sample, another x86-ARM basic block pair B which is similar to the embedded vector of the basic block pair A but has different semantics is found out by calculating the distance, when the anchor point is the ARM basic block of the basic block pair A, the positive sample is the x86 basic block of the basic block pair A, and the negative sample is the x86 basic block of the basic block pair B; when the anchor point is an x86 basic block of the basic block pair A, the positive sample is an ARM basic block of the basic block pair A, and the negative sample is an ARM basic block of the basic block pair B;

the step 6 specifically comprises the following steps:

L＝max{D(E₁,E₂)-D(E₁,E₃)+γ,0}

wherein E is₁、E₂And E₃Embedding of anchor, positive and negative samples, respectively, gamma>0 is a margin parameter, D is an Euclidean distance, D is a loss encouragement (E)₁,E₂) At least the ratio of D (E)₁,E₃) Reducing the margin gamma to obtain a final embedded vector model, inputting x86 and ARM assembly code basic blocks, and outputting x86 and ARM embedded vectors;

2. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 1 specifically comprises the following steps:

3. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 2 specifically comprises the following steps:

4. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 3 specifically comprises the following steps:

The loss function L is represented by the following formula:

wherein,

is t_kThe one-hot code of (1);

5. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 5 specifically comprises the following steps:

6. The semantic-based cross-instruction architecture binary code similarity detection method according to claim 1, characterized by: the step 7 specifically comprises the following steps: