CN111078227B

CN111078227B - Binary code and source code similarity analysis method and device based on code characteristics

Info

Publication number: CN111078227B
Application number: CN201911282875.4A
Authority: CN
Inventors: 霍玮; 袁子牧; 冯牧玥; 李丰; 班固; 肖扬
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-08-31
Anticipated expiration: 2039-12-13
Also published as: CN111078227A

Abstract

The invention relates to a binary code and source code similarity analysis method and device based on code characteristics. The method comprises the following steps: extracting code features which have anti-compilation optimization characteristics and exist together in the source code and the binary code; calculating the similarity between the source code and the binary code by performing feature matching on the extracted code features; and obtaining a conclusion whether the binary codes of the software multiplex the source codes according to the calculated similarity. The extracted code features include: a string, a derived function, a string array, a global constant array, a global enumeration array, a complex Switch/Case structure, a complex If/Else structure. The invention can provide accurate code similarity detection, can widely process the comparison between cross-source codes and binary codes, and overcomes the limitation that the existing method only depends on specific character strings or symbol information.

Description

Binary code and source code similarity analysis method and device based on code characteristics

Technical Field

The invention belongs to the field of program analysis, and relates to a static analysis technology based on feature comparison, wherein the static analysis direction of a code is concerned, and whether a binary code and a source code contain overlapped feature sets is specifically verified to further confirm whether the binary code multiplexes the source code.

Background

At present, most commercial software is closed-source software, namely, software with source codes not disclosed. With the recent increase of software complexity, it is difficult to well cope with security audit of complex closed-source software. The invention uses the ubiquitous component multiplexing in the complex software as an entry point to discover the source code multiplexed in the binary code. When the popular source code base is updated, the repair of the popular source code base in the aspect of safety is often described, and the safety problem of the complex software can be further discovered by comparing and confirming the multiplexing relationship between the binary code and the source code.

The multiplexing comparison between codes can be divided into three categories, namely source codes, binary codes and source codes according to different analysis targets. Due to the fact that the source code and the binary code contain different code information contents and different granularities, analysis technologies for the source code and the binary code are greatly different, and three types of codes are also greatly different from the used technologies.

Feature extraction and comparison are the main steps for comparison based on static analysis. The comparison between the source codes depends on the code text, and information such as variable names, data structures, expressions and the like can be directly obtained from the text, so that a text analysis mode can be directly adopted; furthermore, the source code can be analyzed in syntax and semantics, and information such as an abstract syntax tree, a control flow structure, a function call relation and the like can be extracted and compared more accurately. The binary code is the code expression form of the bottom layer in the computer, so that a plurality of code semantic features are lost, and meanwhile, bottom layer features such as stack frame structures and the like are added, and when the binary code is compared, whether the difference comes from the difference between source codes or the difference realized by the bottom layer cannot be judged; thus, complex features such as I/O behavior of basic blocks, execution sequences, semantic hashing, etc., as well as flow graph and call relation graph based alignment algorithms are designed by the industry for inter-binary code alignment.

The invention is mainly applied to comparison between cross-source codes and binary codes, has relatively few related research results, and can only analyze and compare the binary codes with reserved symbol information. Especially, many complex software selects to remove symbolic information to increase the difficulty of reverse analysis of the software, so that some research results cannot be well applied to the comparison between the source code and the binary code. Binary Analysis Tool (BAT) is the first system Tool which can identify the multiplexing relationship between Binary codes and source codes and is proposed in recent years, and open source library codes in firmware Binary codes are mainly identified through constant character strings; however, for the executable program under Windows, the character string of the open source library code is not necessarily retained by the executable program, and when some open source libraries are released as components, the open source libraries may not contain valid character string information for comparison, and the character string feature is only used for comparison and is not reliable. OSSPLICOLE is a tool for multiplexing and identifying Android C + +/Java codes, a part of studios of the OSSPOLE are subjected to feature matching based on character strings and derived functions and are applied to large-scale analysis in a hierarchical index mode, but the Android C + + codes expose the derived functions, and the derived functions of open source components are packaged into internal function calls like those in Windows, so that the OSSPOLE is not suitable for use. The FIBER generates semantic features of patches by analyzing the influence of open-source library security patches on codes and matches binary codes to judge whether software is patched or not, but the FIBER assumes that all symbol information in a binary program is known and is inconsistent with the situation that the symbol information of the binary codes in normal closed-source software is unavailable.

In summary, the existing research results rely on symbol information or character string information, and cannot widely deal with comparison between cross-source code and binary code, especially, the applicability to executable files under Windows is poor.

Disclosure of Invention

In order to realize the multiplexing comparison of cross-source codes and binary codes and overcome the limitation that the existing method only depends on specific character strings or symbol information, the invention selects code characteristics which can be extracted from the source codes and the binary codes at the same time and have the characteristic of resisting compiling optimization, calculates the code similarity through characteristic comparison and matching, and obtains the conclusion that whether the binary codes of software are multiplexed and compared to open a source code library.

The invention provides a method for extracting and comparing the characteristics of a source code and a binary code, which is resistant to compiling optimization.

The technical scheme adopted by the invention is as follows:

a binary code and source code similarity analysis method based on code characteristics comprises the following steps:

extracting code features which have anti-compilation optimization characteristics and exist together in the source code and the binary code;

calculating the similarity between the source code and the binary code by performing feature matching on the extracted code features;

and obtaining a conclusion whether the binary codes of the software multiplex the source codes according to the calculated similarity.

Further, two evaluation criteria were used to select alignment features: whether a feature can embody the core function or algorithm of an open source component and whether a feature is resistant to compilation optimization effects. Because the binary code is used as a low-level language, semantic information in a plurality of source codes is lost, if variable names do not exist any more, data structure information is difficult to restore, a structure body is expanded, and length information is only reserved in a value class instruction of data; meanwhile, the compiler introduces additional compiler behavior and features, such as stack protection functions, system exception handling mechanisms, etc., into the binary code. These variations make it more difficult to compare the source code to the binary code, and these two standards will help to overcome these effects and improve the success rate of the comparison. The specific selected features are shown in table 1, where V indicates the use of the feature.

TABLE 1 description of selected features and their program affected by compilation optimization

a) General features: deriving functions and string features. The derived functions and strings are hardly affected by compilation optimization, excluding cases where non-derived and removed.

b) Constant and control flow characteristics: string array, global constant array, global enumeration array, complex Switch/Case structure, complex If/Else structure. The string array, the global constant array and the global enumeration array may be collectively referred to as a global constant array. In contrast to integer constants, the global constant array is not an immediate in an assembly instruction, but rather a fixed continuous piece of binary data in a binary file data segment (. data segment and. rdata segment), unaffected by compilation optimization. The two types of control flow type characteristics, namely a complex switch/case structure and a complex if/else characteristic, are relatively stable in the code, wherein the length of the characteristic value of the complex finger is longer than that of the common switch/case and if/else characteristic, and therefore the identification degree is higher in the multiplexing comparison. Although the specific code instructions may be affected by compiling optimization, the core code logic, namely the distribution of case values and branches in the switch/case structure and the constant sequence in the constant comparison process of continuous nesting in if/else are hardly affected by compiling optimization.

The invention selects 7 characteristics, namely a character string, a derived function, a character string array, a global constant array, a global enumeration array, a complex Switch/Case structure and a complex If/Else structure.

Further, 7 anti-compilation optimization features extracted from the source code and the binary code at the same time are compared. The 7 code characteristics can be divided into three types, namely character string type characteristics such as a derived function, a character string array and the like, digital type characteristics such as a global constant array, a global enumeration array and the like, and control flow type characteristics such as a complex switch/case structure, a complex if/else structure and the like. These three types of features have completely different expressions, so different matching algorithms need to be used. The character string type features adopt an accurate matching algorithm, the digital type features adopt a matching algorithm based on binary segment retrieval, and the control flow type features adopt a matching algorithm based on semantic equivalence judgment. The following are described respectively:

a) the character string type features adopt an exact matching algorithm: it is detected whether the binary code has exactly the same features as the source code features. The matching algorithm does not consider the possible conversion of the characteristics so as to realize the quick matching of the massive character string type characteristics;

b) the digital features adopt a matching algorithm based on binary segment retrieval: converting the digital global array characteristics extracted from the source code into a binary data stream, and searching in a data segment (a data segment and a rdata segment) of a binary file to judge whether the source code characteristics exist in the binary file;

c) the flow pattern characteristics are controlled by adopting a matching algorithm based on semantic equivalence judgment: the core semantic information of the Switch/case characteristics is the number of cases, the number of branches and the distribution relation in a single Switch/case structure; when the source code switch/case characteristics are consistent with the case numbers of the binary code switch/case characteristics, the branch numbers are the same, and the source code switch/case characteristics without default branches can be matched with any binary code switch/case characteristics with one branch subtracted, the binary code switch/case characteristics and the source code switch/case characteristics are considered to have the same semantics, and matching is successful. The core semantic meaning of the If/else feature is the longest constant comparison path in the function, and a large number of source code independent constants 0 are introduced into the If/else feature of the binary code under the influence of equivalent instruction replacement in the compiling optimization process, so that the longest constant comparison path in the source code and the binary code is matched to obtain a matching result.

Further, after completing the feature matching, the comparison similarity between the source code and the binary code can be calculated according to the matching result. The core idea of the similarity is to divide the weighted score sum of the matched code feature sets by the weighted score sum of all the feature sets of the matched object, and the calculation formula is as follows:

wherein X represents a piece of binary code, Y represents a piece of source code, f (Y) is all code features contained in the source code Y, MF (X, Y) is a set of matched features of the binary code X and the source code Y, i and j are individual code features, score (i) and score (j) are weights of the code features i and j.

Further, the calculation formula of the weight of the code feature is as follows:

for the above equation, x refers to a single code feature, N is the number of source code libraries, and N (x) is the number of open source libraries that contain code feature x. Meanwhile, w (x) is an information amount weight of the code feature, and the calculation method is as follows:

GetStrweight (x) calculates the information quantity weight of the character string type characteristics based on the character string special characters and the special character combinations, and calculates the occurrence frequency of the character string in all open source code bases based on the existing TF-IDF algorithm, because when the comparison code scale is larger, the code characteristics with huge number can show stronger repeatability; l_iIs the length of the ith continuous 0 or 1 data in the binary stream; GetLength (x) is the length of the feature content in the control flow feature.

Further, based on the calculation of the formula Similarity (X, Y), the Similarity between the binary code segment X and the source code segment Y can be obtained. The user can set a similarity threshold value by himself and judge whether X multiplexes Y.

Based on the same inventive concept, the invention also provides a device for analyzing the similarity between binary codes and source codes based on code characteristics, which comprises:

the feature extraction module is responsible for extracting code features which have anti-compilation optimization characteristics and coexist in the source code and the binary code;

the characteristic matching module is used for performing characteristic matching on the extracted code characteristics;

and the similarity calculation module is responsible for calculating the similarity between the source code and the binary code and obtaining the conclusion whether the binary code of the software multiplexes the source code or not according to the calculated similarity.

The invention has the beneficial effects that:

the invention designs an anti-compilation feature extraction and comparison method facing to source codes and binary codes. Binary code is a machine language and source code is a high-level language, which are vastly different in code expression and contained code information. The method for realizing the design of the invention selects and extracts the corresponding anti-compilation code characteristics and carries out characteristic matching, obtains the multiplexing relation between the binary code and the source code by calculating the similarity, and provides accurate code similarity detection. The method can widely process the comparison between the cross-source code and the binary code, and overcomes the limitation that the existing method only depends on specific character strings or symbol information.

Drawings

FIG. 1 is a flow chart of the steps of the method of the present invention.

FIG. 2 is an example of switch/case and if/else structured code fragments.

FIG. 3 is an example of a global constant array W (x) calculation.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

The general step flow of the method for analyzing similarity between a binary code and a source code based on code characteristics is shown in fig. 1, and the method includes extracting code characteristics with anti-compilation optimization characteristics, which coexist in the source code and the binary code, then calculating similarity between the source code and the binary code by performing feature matching on the extracted code characteristics, and then obtaining a conclusion whether the binary code of software multiplexes the source code or not according to the calculated similarity.

In this embodiment, 7 anti-compilation optimization features are simultaneously extracted from the source code and the binary code and compared, and feature extraction is the first and critical implementation step. The following describes the steps of extracting the source code feature and the binary code feature.

(1) The source code feature extraction method comprises the following steps: LLVM and Clang are very mature open source C/C + + compilers, provide many source code analysis interfaces, and the invention obtains source code characteristics through the interfaces. The source code feature extraction is realized based on open source compilation framework Clang and LLVM development tools. The source code comprises all 7 types of code characteristics such as character strings, derived functions, character string arrays, global constant arrays, global enumerated quantity arrays, complex switch/case structures, complex if/else structures and the like. The top 5 class features can be extracted from the abstract syntax tree generated after Clang analyzes the source code. Wherein, the character string is morpheme (Token) with type LITERAL in the abstract syntax tree, and the FUNCTION name is Cursor (Cursor) with type FUNCTION _ DECL or CXX _ METHOD in the abstract syntax tree. The structure of the three types of array features in the abstract syntax tree is relatively complex, and as shown in table 2, the structure needs to be determined by a multi-layer structure.

TABLE 2. corresponding Structure of Source code features in abstract syntax Tree

(2) The binary code feature extraction method comprises the following steps: the features extracted in the binary code are different from the source code. Three array features cannot be restored from binary code because the source code loses information such as data structures. Therefore, for the three types of characteristics, the method does not directly extract from the binary code, but converts the characteristics extracted from the source code into a binary data stream, and directly searches the data segment of the binary file to judge whether the characteristics exist. For the derived function, the character string, the switch/case structure and the if/else structure, the character string features are extracted through interfaces such as get _ strlist _ qty of IDAPython. The export features are extracted by a PE file structure parsing tool. The Switch/case feature is extracted from the control flow graph provided by the IDA by a function such as get _ Switch _ info. The If/else feature is obtained by analyzing the basic Block and jump relation in the control flow graph, and for each function, backtracking is started from a Return basic Block (Return Block). If all predecessors of a basic block have been analyzed to obtain the longest if/else alignment constant sequence, then its longest if/else alignment constant sequence is the longest constant sequence in all predecessors, plus the constant in the comparison instruction contained in itself. If there are more predecessors that have not been analyzed, the predecessors are analyzed first. And after all the basic blocks are analyzed, the longest comparison constant sequence in the function is the longest constant sequence in the plurality of returned basic blocks.

(3) Comparing the source code and the binary code characteristics: the code feature matching is realized by three modes of accurate matching, matching based on binary segment retrieval, matching based on semantic equivalence judgment and the like.

a) The character string type features adopt an exact matching algorithm: checking whether the character strings in the source code and the binary code are completely consistent;

b) the digital features adopt a matching algorithm based on binary segment retrieval: converting two types of digital features such as a global constant array, a global enumeration array and the like in a source code into a binary data stream, and directly searching in a data segment of a binary file to judge whether the two types of digital features are matched;

c) the flow pattern characteristics are controlled by adopting a matching algorithm based on semantic equivalence judgment: the method is as described in the above summary.

FIG. 2 is an example of switch/case and if/else structured code fragments. In the switch/case structure, the number of cases is 6, the number of branches is 2, and the distribution relationship is [ [8], [0,3,4,5,10] ], that is, the value 8 will enter the first branch, and the value [0,3,4,5,10] will enter the second branch. Considering that there is a default branch in the source code, and the value corresponding to this branch cannot be analyzed and obtained from the case list of the source code, in the process of comparing the source code with the binary code features, the source code switch/case features without default branch will be matched with the binary code switch/case features minus any one branch. In addition, during the compiling process, the value of the case may be optimized to the value starting from 0, for example, a case value sequence [100,101,102] may be optimized to [0,1,2 ]. Therefore, when the branch distribution is identical, or when the branch distribution of the source code is optimized to be identical to the branch distribution of the binary code after the values from 0 are taken, the branch distributions of the two are considered to be matched. In general, when the source code switch/case characteristics are consistent with the case numbers of the binary code switch/case characteristics, the branch numbers are the same, and the source code switch/case characteristics without default branches can be matched with any binary code switch/case characteristic minus one branch, the binary code switch/case characteristics and the source code switch/case characteristics are considered to have the same semantics, and the matching is successful. For the if/else feature, the longest constant alignment path of the code fragment in FIG. 2 is [0x80,0x800,0x10000,0x200000 ]. The if/else characteristics of the binary code are influenced by equivalent instruction replacement in the compiling optimization process, and a large number of source code independent constants 0 can be introduced. If the source code if (a > b) determines that the two values are equal, it may be converted into the binary instructions sub r1, r2 and cmp r1,0 that determine whether the difference between the two values is 0, so as to introduce the constant 0, and therefore it is necessary to remove the constant 0 in the longest constant alignment path of the binary code. After the binary code features are processed, the source codes and the longest constant comparison path in the binary code can be matched to obtain a matching result.

(4) Calculating the similarity: the method mainly comprises the following steps of (x) calculating W, wherein GetStrweight (x) for calculating the character string type characteristics calculates the information weight based on the occurrence frequency by adopting an existing TD-IDF method; the calculation GetLength (x) of the control flow pattern feature is the length of the feature content. When the feature x is a digital feature, as shown in FIG. 3, a global constant array [0x1,0x10,0x100,0x1000 [ ]]And [0x6a09e667,0xbb67ae85,0x3c6ef372,0xa54ff53a]For example, the former calculation method is

The latter being

That is, the more frequently 0 and 1 are switched, that is, the shorter the consecutive 0 or 1 is, the lower the probability of occurrence of data is, and the larger the amount of information contained therein is.

After the similarity is calculated, a similarity threshold value can be further set, and whether the binary code segment multiplexes the source code segment or not is judged.

In order to illustrate the effect of the method, 1000 commercial software and 264 common open source library items are selected, and the following 10 most frequently reused open source library items are counted through feature extraction and comparison of the source code and the binary code.

TABLE 3.10 most frequently reused open Source library entries

Open source library name	Multiplexing frequencies	Open source library name	Multiplexing frequencies
				zlib	384 times	tinyxml	106 times of
libjpeg	257 times	libpng	94 times
				sqlite	138 times of	ffmpeg	94 times
openssl	124 times	libtiff	84 times
				qt	112 times (x)	libjpeg	77 times

Based on the same inventive concept, another embodiment of the present invention provides a device for analyzing similarity between binary code and source code based on code characteristics, comprising:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A binary code and source code similarity analysis method based on code characteristics is characterized by comprising the following steps:

extracting code features which have anti-compilation optimization characteristics and exist together in the source code and the binary code; binary code feature extraction, which is to convert the features extracted from the source code into a binary data stream, and directly search the data segment of a binary file to judge whether the binary code features exist;

calculating the similarity between the source code and the binary code by performing feature matching on the extracted code features; the feature matching of the extracted code features comprises the following steps: the character string type features adopt an accurate matching algorithm, the digital type features adopt a matching algorithm based on binary segment retrieval, and the control flow type features adopt a matching algorithm based on semantic equivalence judgment;

2. The method of claim 1, wherein the extracting of code features with anti-compilation optimization characteristics that are present in both the source code and the binary code, and selecting the alignment features using two evaluation criteria: whether a feature can embody the core function or algorithm of an open source component and whether a feature is resistant to compilation optimization effects.

3. The method of claim 1, wherein the extracted code features comprise: a string, a derived function, a string array, a global constant array, a global enumeration array, a complex Switch/Case structure, a complex If/Else structure.

4. The method of claim 1, wherein the matching algorithm based on binary segment search converts the digital features extracted from the source code into a binary data stream, and searches in a data segment of the binary file to determine whether the source code features exist in the binary file.

5. The method according to claim 1, wherein the matching algorithm based on semantic equivalence determination comprises:

for the Switch/case characteristics, when the source code Switch/case characteristics are consistent with the case quantity of the binary code Switch/case characteristics, the branch quantity is the same, and the source code Switch/case characteristics without default branches can be matched with any binary code Switch/case characteristic minus one branch, the binary code and the Switch/case characteristics of the source code are considered to have the same semantics, and the matching is successful;

and for the If/else characteristics, matching the longest constant comparison path in the source code and the binary code to obtain a matching result.

6. The method of claim 1, wherein the similarity between the source code and the binary code is calculated by dividing the sum of the weighted scores of the code feature sets on the match by the sum of the weighted scores of all feature sets of the matched object; the calculation formula of the similarity is as follows:

7. The method of claim 6, wherein the weight of the code feature is calculated by:

where x refers to a single code feature, N is the number of source code libraries, and N (x) is the number of open source libraries containing code feature x;

w (x) is the information amount weight of the code feature, and the calculation formula is:

wherein GetStrWeight (x) calculates character based on special character and special character combination of character stringThe information quantity weight of the string type characteristics is calculated on the basis of the existing TF-IDF algorithm according to the occurrence frequency of the character strings in all open source code bases; l_iIs the length of the ith continuous 0 or 1 data in the binary stream; GetLength (x) is the length of the feature content in the control flow feature.

8. A device for analyzing similarity between binary code based on code characteristics and source code by using the method of any one of claims 1 to 7, comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.