CN113010209A

CN113010209A - Binary code similarity comparison technology for resisting compiling difference

Info

Publication number: CN113010209A
Application number: CN202011117765.5A
Authority: CN
Inventors: 刘嘉勇; 王炎; 贾鹏
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-06-22

Abstract

The invention relates to the technical field of binary code similarity detection and the field of graph convolution neural networks, and aims to provide a binary code similarity comparison technology for resisting compilation difference. The core of the technology is to convert a binary function into a graph network, and learn semantic, syntactic and structural information of the function by utilizing a graph embedding network so as to generate a graph embedding vector representation of the function. The work flow of the technology is to convert a binary function into an attribute control flow graph, and then extract the grammatical and semantic features of each node. And respectively training a grammatical attribute control flow graph and a semantic attribute control flow graph of the function by using a graph embedding network, and finally generating a graph embedding representation of the function. Wherein a mechanism of attention is employed in the embedding vector aggregation at the instruction, basic block and function stages, respectively. And finally, carrying out similarity comparison on cross-compiler, cross-optimization level, cross-program version and obfuscated code by using the generated embedded vector. The technology provides a new solution for similarity detection of binary codes.

Description

Binary code similarity comparison technology for resisting compiling difference

Technical Field

The invention relates to the technical field of similarity detection of binary codes and the technical field of graph convolution neural networks. The main core is that the characteristics of three aspects of grammar, semantics and structure of the function are combined, a graph convolution neural network based on an attention mechanism is used for obtaining a grammar graph embedding vector and a semantic graph embedding vector of the function, a multi-layer attention mechanism is used for aggregating to obtain a final function graph embedding vector, and the final similarity detection in cross-compiler, cross-optimization level, cross-program version and confusion code is carried out based on the vector.

Background

Code similarity detection refers to comparing two or more code segments to determine whether similar codes exist between the two or more code segments, and can be divided into source code similarity detection and binary code similarity detection. Binary code lacks more symbolic information than source code, and the difficulty of binary code similarity detection is higher due to the adoption of different compilers, different optimization levels, different program versions, the utilization of obfuscation techniques and the like. Binary code similarity detection can be applied to many fields, such as code clone detection, encryption function identification, malicious code detection, malicious code family classification, vulnerability search, security patch analysis and other fields. Most of the traditional methods are based on Hash fuzzy matching, minimum subgraph matching or detection by utilizing technologies such as symbolic execution, taint analysis and the like. In recent years, it has been proposed to use machine learning-based methods for similarity detection, where the grammatical, semantic and structural information of the main learning function is used to construct a high-dimensional vector representation. However, the existing detection techniques still have the following problems.

Firstly, the extracted information of the binary function is not comprehensive enough, and if the detection is performed only by using the grammatical information or semantic information of the function, the detection precision is low.

Secondly, the existing assembly instruction normalization method has a serious Out-Of-Vocabulariy problem.

And thirdly, the method is low in robustness and cannot effectively resist interference of cross-version, cross-compiler, cross-optimization level, cross-program and obfuscation technology.

At present, the existing binary code check aiming at the same compiler version, the same optimization level, the same program version and the non-confusion can not meet the current similarity detection requirement at all, and particularly, the detection precision of the original technology is lower under the condition that the codes are protected by using the confusion technology. There is a need for a new method that can overcome these differences under the current conditions of many influencing factors, so as to effectively improve the accuracy of similarity detection on binary codes.

Disclosure of Invention

The invention provides a binary code similarity comparison technology for resisting compilation difference, which is an invention provided for solving the problems in the prior art in the process of detecting the similarity of binary codes. The invention aims to solve the problem that the existing detection method has low detection precision under the condition of high difference caused by different compilers, different optimization levels, different versions and confusion technologies, and provides a binary function embedding generation technology based on a hierarchical attention-seeking neural network so as to effectively resist the influence of various difference factors and improve the detection precision. The detection method provides a new detection idea, a new word vector generation model is constructed by combining the multi-dimensional characteristics of the binary function, and more hidden information of the function is reserved by using the graph convolution neural network technology, so that the influence of the difference of various compiling factors on the detection accuracy is effectively resisted. The method can be widely used in various binary code similarity detection scenes, and compared with the traditional method, the detection precision under the condition of high difference is obviously improved.

In order to achieve the above object, the present invention provides a binary code similarity comparison technique for resisting compilation difference, which can extract grammatical, semantic and structural features from a provided binary function, then generate a basic block embedding vector based on a pre-trained word vector model, then train a graph convolution neural network on an attribute control flow graph of the function to generate a final function embedding vector representation, and perform similarity detection of the function by using the embedding vector. The invention provides a layered attention diagram embedded network, wherein in order to embody different instructions, different basic blocks and grammatical information and semantic information of functions, the grammatical information and the semantic information have different importance for similarity comparison. The network comprises three layers of attention mechanisms: instruction level, basic block level, and function level. The technical framework comprises five modules of feature extraction, an instruction embedding generation model, a basic block embedding vector generation model, a graph embedding network, a similarity comparison model and the like. The feature extraction module is mainly used for extracting grammatical features, semantic features and control flow graphs of the functions and carrying out normalization processing on the extracted features; the instruction embedding generation model is mainly used for training an embedding vector of each word on a normalized post-assembly instruction and generating an embedding vector representation of the instruction; the basic block embedding vector generation module is used for respectively generating a corresponding basic block grammar embedding vector and a basic block semantic embedding vector by using the grammar feature and the semantic feature of each basic block, wherein the semantic embedding representation of the basic block is generated based on the attention mechanism aggregation instruction embedding vector; the graph embedding network module respectively trains an input function grammar attribute control flow graph and a function semantic attribute control flow graph, generates a grammar embedding vector of a function and a semantic embedding vector of the function based on a basic block level attention mechanism, and finally generates a final function embedding vector based on the grammar embedding vector and the semantic embedding vector of an attention mechanism aggregation function; and the similarity comparison module carries out similarity calculation according to the embedding of the two generated function graphs to obtain a final similarity result.

Drawings

The objects, implementations, advantages and features of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which.

FIG. 1 is an architectural diagram showing the overall structure of the detection technique of the present invention.

FIG. 2 is a diagram illustrating an example of instruction normalization in the detection technique of the invention.

FIG. 3 is a flow diagram illustrating a word embedding training model in the detection technique of the present invention.

FIG. 4 is a flow chart illustrating the basic block embedding vector generation process in the detection technique of the present invention.

FIG. 5 is an architectural diagram illustrating a graph embedding a network model in the detection technique of the present invention.

FIG. 6 is a flow chart illustrating attention-based embedded vector aggregation in the detection technique of the present invention.

FIG. 7 is a flow chart illustrating a comparison of binary function similarity in the detection technique of the present invention.

Detailed Description

The similarity detection technology can be used for cross-compiler, cross-optimization level and cross-version code clone check, and can also be applied to scenes using different code obfuscation technologies. The invention is further described below with reference to the accompanying drawings. The invention aims to provide a binary function similarity comparison technology aiming at various complex compiling scenes, which is based on information contained in a binary function and utilizes graph-embedded network training to obtain high-precision vector representation and can be effectively used for detecting scenes such as clone codes, vulnerability search, malicious code family classification and the like.

FIG. 1 is an architectural diagram illustrating the technique of the present invention.

As shown in fig. 1, the detection technique first extracts an attribute control flow graph of a function, each basic block represents a node, the jump relationship between the basic blocks represents an edge between nodes, and an attribute is an assembly instruction of the basic block. After the attribute control flow graph is extracted, the features of each basic block are extracted, and the features mainly comprise two types: syntactic and semantic features. The grammatical features mainly count the regularity and structural features of some instructions in each basic block. The semantic features are that the assembly instructions are used as texts, and vector representation of each instruction is obtained by utilizing a natural language processing technology. And then inputting the extracted syntactic characteristics and semantic characteristics into a basic block embedding generation model to obtain the syntactic embedding representation and the semantic embedding representation of each basic block in the attribute control flow graph. At the moment, each function respectively comprises a function grammar attribute control flow and a function semantic attribute control flow, then the two attribute control flow are respectively input into a graph embedding network based on an attention mechanism to be trained to obtain a semantic graph embedding vector of the function and a grammatical graph embedding vector of the function, and finally the attention mechanism is utilized to aggregate the two vectors to obtain the final function graph embedding representation. The detection technique computes the distance between the function-embedded representations based on the siemese network to represent the similarity between the functions.

FIG. 2 is a schematic diagram illustrating the normalization process of an assembly instruction.

As shown in fig. 2, an assembly instruction is mainly composed of operands and operation codes, the number of operation codes under each platform is not large, but the number of operands after various combinations is large. Compiling at different times results in operands using different registers, using the same compiler, the same level of optimization for the same source code. In order to avoid these differences more effectively. Normalization of the operands of the assembly instruction is required. The normalization scheme adopted by the invention is mainly divided into 3 types, if the operand belongs to the register type, the operand is respectively normalized into 'reg 1', 'reg 2', 'reg 3' and 'reg 4' according to the size occupied by the register; if the operand belongs to the immediate type, judging whether the immediate belongs to the character string, if so, representing by 'STR', otherwise, representing by 'HIMM'; and if the operand belongs to the memory type, judging whether the operand is addressed based on the base address, if not, replacing the operand by using ' MEM ', if the operand is addressed based on the base address and is not combined with the index, representing the operand by using ' normalized register ' + HIMM ] ', and otherwise, representing the operand by using ' normalized register ' + index ' + HIMM ] '.

Fig. 3 is a diagram of a descriptor embedding model.

After normalization processing is performed on the assembly instruction, the operand and the operation code are respectively used as token for word vector training. In order to construct a training corpus containing more semantic information tokens, the invention carries out random walk on the basis of an inter-process function call graph, and combines tokens in the walk path each time into a sentence. The entire binary file is then used as a document for word vector training, the training process being illustrated in FIG. 3. And when the vector of each target token is trained, taking the previous instruction and the next instruction of the target instruction as context information, and if the target instruction is the first instruction, only taking the next instruction. If the target instruction is the last instruction, only the last instruction is fetched. The processing process of each instruction is as follows: each token has an initialization vector, the average value of the token vectors in each instruction is taken firstly, then the operation code token vectors are divided by the number of the tokens in the instruction, and then the difference operation is carried out with the average value of the tokens. And splicing the calculated result with the opcode token vector to obtain the vector representation of the instruction. And averaging the last instruction vector and the next instruction vector of the target instruction, and then calculating through NCE loss to obtain the vector representation of token in the target instruction. And repeating the above steps to finally obtain the vector representation of each token. The invention considers that the importance degrees of each operation code token are different, obtains the importance degree of each operation code token by combining TF-IDF, then multiplies the importance degree parameter by the operation code token, and then splices with the average value of the operation code token vector to obtain the vector representation of each assembly instruction.

Fig. 4 is a flow chart describing the basic block embedding vector generation process.

As shown in fig. 4, the two types of features extracted from the basic block are respectively processed in different manners to obtain corresponding basic block embedded vector representations. The grammatical characteristics mainly count the number of constants, the number of character strings, the number of transfer instructions, the number of call instructions, the number of basic block instructions, the number of arithmetic instructions, the number of subsequent nodes and the centrality of betweenness in each basic block. For the extraction of these 8 features, the values of each feature are combined to construct a1 × 8 dimensional array. The array is then directly used as the syntax embedding representation of the basic block. Aiming at semantic features, firstly, normalization processing is carried out on each instruction based on a normalization rule to obtain a normalized token. And inputting the token subjected to normalization processing in the basic block into an instruction embedding model to obtain the embedded vector representation of each instruction. The importance degree of each instruction in a basic block is different, and in order to more highlight the action of important instructions and weaken the contribution of other instructions, the invention adopts an attention mechanism to aggregate the instruction vectors of the basic block. And obtaining the semantic embedded representation of the basic block after aggregation.

FIG. 5 is an architectural diagram depicting an embedded network model.

As shown in FIG. 5, the present invention uses a struc2vec graph embedded network as a training network. But we have improved the struc2vec network on the basis of the existing network. And after T rounds of iteration, the basic struc2vec network adopts a summing mode to carry out aggregation. However, in the present invention, we take into account the difference in importance between different basic blocks, and introduce a mechanism of attention. In the improved struc2vec network, we aggregate after T rounds of iteration by an attention mechanism to highlight that important basic blocks can take more weight after aggregation.

FIG. 6 is a flow chart describing embedded vector aggregation based on an attention mechanism.

As shown in FIG. 6, after graph embedding network training, each attribute control flow graph may be represented by a graph embedding vector. In the invention, a grammatical attribute control flow graph and a semantic attribute control flow graph are extracted from one function, and a function grammatical graph embedding vector and a function semantic graph embedding vector are correspondingly generated. In order to more effectively aggregate the two vector representations, and embody contributions of different degrees for function graph embedding, the invention adopts an attention mechanism to aggregate the two vectors. Firstly, function semantic graph embedding and function syntactic graph embedding are spliced, meanwhile, semantic graph embedding and self-splicing are carried out, and then parameters are introduced for calculation to obtain attention weight. And multiplying the attention weight by the corresponding embedded vector, and summing to obtain the final embedded representation of the function graph.

FIG. 7 is a flow chart of binary function similarity comparison.

As shown in fig. 7, the present invention employs the Siamese architecture for similarity detection. To use this architecture, the present invention duplicates the function graph embedding generation network into two and then inputs it as a pair of binary functions. After the function graph is embedded into the network, the two functions are respectively expressed by embedding the function graph into vectors, and finally the cosine distance of the two vectors is calculated to obtain the final similarity.

As described above, the present invention performs similarity detection of binary codes by generating a graph-embedded representation of a binary function, and has the advantages of: 1. the characteristics of different dimensions of the function are extracted, and the detection precision is improved. 2. Compared with the existing method, the provided normalization method for the assembly instruction is more effective in avoiding the OOV problem. 3. Attention mechanisms are respectively introduced at three positions of instruction aggregation, basic block aggregation and function embedding aggregation. This mechanism allows more important vectors to have a greater impact on the final similarity comparison when vector aggregation is performed. 4. The detection technology of the invention can effectively improve the similarity detection precision under different compilers, different optimization levels, different compiling platforms and mixed code scenes.

Although the preferred embodiments of the present invention have been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims

1. A binary code similarity comparison technique against compile differences, the method comprising the steps of:

A. disassembling the binary file, constructing a binary function attribute control flow graph, and extracting the grammar and semantic information of the basic block;

B. constructing a basic block embedded vector generation model, and converting the grammar and semantic information of the extracted basic block into a numerical vector;

C. respectively training a grammatical attribute control flow graph and a semantic attribute control flow graph of a function by using a graph embedding network to generate a function grammatical graph embedding vector and a function semantic graph embedding vector;

D. embedding vectors by using a grammar graph of the attention mechanism aggregation function and embedding vectors by using a semantic graph to generate final function graph embedding;

E. and performing similarity detection between functions based on the generated graph embedding vector.

2. The binary code similarity comparison technique against compile differences according to claim 1, wherein said step a further comprises the steps of:

a1, the extracted grammatical features include 8 types: the number of constants, the number of character strings, the number of transfer instructions, the number of call instructions, the number of basic block instructions, the number of arithmetic instructions, the number of subsequent nodes and the centrality of betweenness;

a2, the extracted semantic features are assembly instructions of the functions.

3. The binary code similarity comparison technique against compile differences as claimed in claim 2, wherein the assembler instruction extracted in step a2 further needs to be normalized as follows:

a21, if the operand of the instruction belongs to the register type, normalizing the operand into 'reg 1', 'reg 2', 'reg 3', 'reg 4' according to the occupied size of the register;

a22, if the operand of the instruction belongs to the immediate type, according to whether the immediate belongs to the character string, if so, using 'STR' to represent, otherwise using 'HIMM' to represent;

a23, if the operand belongs to the memory type, judging whether the operand is addressed based on the base address, if not, replacing the operand by [ MEM ] ", if the operand is addressed based on the base address and not combined with the index, representing the operand by using the [" normalized register "+ HIMM ]", otherwise, representing the operand by using the [ "normalized register" + index HIMM + HIMM ] ".

4. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step B further comprises the steps of:

b1, the number of the corresponding 8 grammar features in each basic block in the statistical function, and directly combining the numerical values into a vector to represent the grammar vector of the basic block;

b2, performing randomized walk by controlling a flow graph among functions, converting the normalized assembly instruction into a token word bank based on a walk path, and then performing word vector training by taking the whole binary program as a document;

b3, multiplying the operation code toekn vector by a statistical word frequency-inverse text frequency index (TF-IDF) parameter, and then adding the average value of the operation number token vector to obtain the vector representation of each assembly instruction;

and B4, aggregating the vector of each assembly instruction obtained in the basic block through an attention mechanism to obtain a semantic vector representation of the basic block.

5. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step C further comprises the steps of:

c1, constructing a grammatical attribute control flow graph by taking the grammatical vector of each basic block generated by the B1 as the attribute of the control flow graph, taking the grammatical control flow graph as input, embedding a network training model by using a Struc2vec graph, and performing aggregation by adopting an attention mechanism during node aggregation;

and C2, constructing a semantic control flow graph by taking the semantic vector of each basic block generated by the B4 as the attribute of the control flow graph, taking the semantic control flow graph as input, embedding a network training model by using a Struc2vec graph, and performing aggregation by adopting an attention mechanism during node aggregation.

6. The binary code similarity comparison technique against compile differences as claimed in claim 1, wherein said step D further comprises the steps of:

d1, splicing the generated function grammar graph embedding vector and the function semantic graph embedding vector to train attention parameters;

and D2, respectively multiplying the trained attention parameters by the corresponding graph embedding vectors, and aggregating to generate a final function graph embedding vector.

7. The binary code similarity comparison technique against compilation difference as claimed in claim 1, wherein the step E is specifically as follows:

e1, combining the graph embedding network proposed by us with a Siamese network, inputting into a pair form, wherein the model comprises two identical network structures but shares the same parameters;

and E2, calculating the distance between the two graph embedding vectors by using cosine similarity to obtain the similarity between the functions.