CN117473494A

CN117473494A - Method and device for determining homologous binary files, electronic equipment and storage medium

Info

Publication number: CN117473494A
Application number: CN202310666471.5A
Authority: CN
Inventors: 袁明亮; 马乐霖
Original assignee: Ardsec Beijing Technology Co ltd
Current assignee: Ardsec Beijing Technology Co ltd
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2024-01-30
Anticipated expiration: 2043-06-06
Also published as: CN117473494B

Abstract

The application provides a method and a device for determining homologous binary files, electronic equipment and a storage medium, wherein the method for determining the homologous binary files comprises the following steps: acquiring a plurality of binary files; converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file, wherein the mapping relation exists between instructions in the preset low-level language and instructions in a plurality of different types of high-level languages; and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files. Whether the corresponding two binary files are homologous or not can be accurately determined, and accuracy of homology analysis of the binary files is improved.

Description

Method and device for determining homologous binary files, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer security technologies, and in particular, to a method and apparatus for determining a homologous binary file, an electronic device, and a storage medium.

Background

With the development of information technology, more and more malicious programs are spread in embedded devices with ARM, MIPS and other architectures. That is, for the same malicious source code, binary programs of different architectures are compiled, and then spread in devices of corresponding architectures, so as to realize malicious behaviors. In order to combat such malicious behavior, similarity analysis needs to be performed on the binary programs to determine whether the binary programs belong to the same malicious source code, so as to implement homology tracking of the binary programs, thereby more effectively avoiding the spread of the same malicious source code in various embedded devices.

At present, similarity analysis is performed on a cross-architecture binary program, and the main method is as follows: for each binary file to be subjected to similarity analysis, first, the codes in the binary file are converted into PyVEX-based VEX-IR expressions, resulting in a VEX-IR file. VEX-IR is a high-level intermediate representation that is architecture independent, and can be converted and represented from machine code. Then, each function in the VEX-IR file is simulated and executed, and semantic feature sequences comprising input and output values, condition comparison values and library function call records are extracted. And then, for the semantic feature sequences corresponding to each binary file, adopting a longest common subsequence (Longest Common Subsequence, LCS) algorithm to calculate the similarity between the semantic feature sequences of the two binary files, and further obtaining the similarity between the two binary files, thereby determining whether the two binary files belong to the same source code.

However, the code in the binary files is converted into the VEX-IR expression based on PyVEX, and the called instruction sets are different due to different architectures of the binary files, so that after the binary files are converted into the VEX-IR files, specific instructions presented in the VEX-IR files are different for the same behavior, the instruction differences can cause differences among semantic features extracted from the VEX-IR files, the homologous binary files cannot be judged to be homologous due to the differences among the semantic features under the VEX-IR expression, and the accuracy of the homology tracking of the binary files is reduced.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for determining homologous binary files, electronic equipment and a storage medium, so as to improve accuracy of similarity comparison between binary files and further improve accuracy of homologous tracking of the binary files.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

a first aspect of the present application provides a method for determining a homologous binary file, the method comprising: acquiring a plurality of binary files; converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file, wherein the mapping relation exists between instructions in the preset low-level language and instructions in a plurality of different types of high-level languages; and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files.

A second aspect of the present application provides a device for determining a homologous binary file, the device comprising: the acquisition module is used for acquiring a plurality of binary files; the conversion module is used for converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file, wherein the instructions in the preset low-level language have a mapping relation with the instructions in a plurality of different types of high-level languages; and the determining module is used for determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files.

A third aspect of the present application provides an electronic device, including: a processor, a memory, a bus; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the method of the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium, the storage medium comprising: a stored program; wherein the program, when run, controls a device in which the storage medium is located to perform the method in the first aspect.

Compared with the prior art, the method for determining the homologous binary files provided in the first aspect of the application comprises the steps of firstly converting the binary files into a preset low-level language after the binary files to be subjected to homology judgment are obtained, then carrying out similarity calculation on codes in the corresponding two files in the preset low-level language, and determining whether the corresponding two files are homologous according to a similarity calculation result. After the binary files are converted into the preset low-level language, the content in the binary files is not only machine-readable, so that the similarity calculation can be carried out on the corresponding two files, the calculation result can also accurately represent the similarity degree between the corresponding two binary files, whether the corresponding two binary files are homologous or not can be accurately determined, and the accuracy of homology analysis of the binary files is improved.

The determining apparatus for a homologous binary file provided in the second aspect of the present application, the electronic device provided in the third aspect, and the computer-readable storage medium provided in the fourth aspect have the same or similar advantageous effects as the determining method for a homologous binary file provided in the first aspect.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings, in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a flowchart illustrating a method for determining a homologous binary file according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an overall architecture of a method for determining a homologous binary file according to an embodiment of the present application;

FIG. 3 is a second flow chart of a method for determining a homologous binary file according to an embodiment of the present application;

FIG. 4 is a schematic diagram showing the effect of a method for determining a homologous binary file according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a determining device for a homologous binary file according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

At present, the homology analysis is carried out on the binary files, namely the binary files are mainly converted into VEX-IR expression, semantic feature sequences of the binary files are obtained under the VEX-IR expression, and whether the binary files to be analyzed are homologous or not is further determined through the similarity of the semantic feature sequences. However, after the binary files are converted into the VEX-IR expressions, because the instruction sets called by the binary files under different architectures are different, the specific instructions presented by the files under the VEX-IR expressions are different for the same behavior, and the differences of the instructions can cause differences between semantic features extracted from the files, so that homologous binary files cannot be judged to be homologous due to the differences of the semantic features under the VEX-IR expressions, and the accuracy of binary file homology tracking is reduced.

The inventor finds that under the expression of VEX-IR, the difference of semantic features of each file aiming at the same behavior is a main reason for low accuracy of binary file homology analysis. If the binary file can be replaced by a expression closer to assembly language, under the expression, the calculation result of the similarity of the files can be ensured to be more accurate by calculating the similarity between the files, and the accuracy of the homology analysis of the binary file is further improved.

In view of this, the embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for determining a homologous binary file, after obtaining a binary file to be subjected to homology determination, converting the binary file into a preset low-level language, performing similarity calculation on codes in two corresponding files in the preset low-level language, and determining whether the two corresponding files are homologous according to a similarity calculation result. After the binary files are converted into the preset low-level language, the content in the binary files is not only machine-readable, so that the similarity calculation can be carried out on the corresponding two files, the calculation result can also accurately represent the similarity degree between the corresponding two binary files, whether the corresponding two binary files are homologous or not can be accurately determined, and the accuracy of homology analysis of the binary files is improved.

First, a detailed description is given of a method for determining a homologous binary file provided in the embodiment of the present application.

Fig. 1 is a schematic flow chart of a method for determining a homologous binary file according to an embodiment of the present application, and referring to fig. 1, the method may include:

s101: a plurality of binary files is obtained.

The plurality of binary files herein are a plurality of binary files waiting for homology analysis. Plural may refer to two, three, four, etc. The specific number of the plural components is not limited herein. When the number of the binary files is two, whether the two binary files are homologous or not is analyzed. When the plurality is three or more, it may be analyzed whether the binary files are homologous or whether the binary files belong to the same source code.

S102: and converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file.

Wherein, the mapping relation exists between the instruction in the preset low-level language and the instruction in the high-level languages of a plurality of different types.

The preset low-level language may be any existing non-binary language. For example: LLVM IR language, VEX-IR language, etc. Specific content of the preset low-level language is not limited herein.

It should be noted here that LLVM is an open source cross-platform compiler infrastructure that contains a series of modular compiler components and tool chains that are used to develop the front-end and back-end of the compiler, which provides a well-written intermediate representation of code that can be used as a back-end in multiple languages. While LLVM IR is an abbreviation for LLVM Intermediate Representation, a low-level language, an instruction set similar to a reduced instruction set computer (Reduced Instruction Set Computer, RISC), different high-level languages can be mapped onto LLVM IR and in close proximity to assembly language.

After a plurality of binary files are acquired, each binary file needs to be converted into a preset low-level language to obtain a corresponding first intermediate file. For example: after the binary file a and the binary file b are obtained, converting the binary file a into a preset low-level language to obtain a first intermediate file a, and converting the binary file b into the preset low-level language to obtain the first intermediate file b. And after the binary files are converted into the preset low-level language, correspondingly obtaining the first intermediate files.

S103: and determining whether binary files corresponding to the two corresponding second intermediate files belong to the same source code according to the similarity of codes in the two corresponding second intermediate files.

After each binary file is converted into a preset low-level language to obtain each first intermediate file, similarity calculation can be performed between codes in each first intermediate file, and whether the corresponding two binary files are homologous is determined according to a similarity calculation result.

For example, the binary file a is converted into the first intermediate file a through a preset low-level language, and the binary file b is converted into the first intermediate file b through a preset low-level language. Then, the similarity of the codes in the first intermediate file a and the first intermediate file b is calculated. When the similarity reaches the preset similarity, the first intermediate file a and the first intermediate file b are similar, and the binary file a and the binary file b are similar correspondingly, so that the binary file a and the binary file b are determined to belong to the same source code. And when the similarity does not reach the preset similarity, the first intermediate file a and the first intermediate file b are not similar, and the binary file a and the binary file b are not similar correspondingly, so that the binary file a and the binary file b are determined not to belong to the same source code.

As can be seen from the foregoing, in the method for determining the homologous binary file according to the embodiment of the present application, after the binary file to be subjected to homology determination is obtained, the binary file is first converted into a preset low-level language, then similarity calculation is performed on codes in two corresponding files in the preset low-level language, and then whether the two corresponding files are homologous is determined according to the result of the similarity calculation. After the binary files are converted into the preset low-level language, the content in the binary files is not only machine-readable, so that the similarity calculation can be carried out on the corresponding two files, the calculation result can also accurately represent the similarity degree between the corresponding two binary files, whether the corresponding two binary files are homologous or not can be accurately determined, and the accuracy of homology analysis of the binary files is improved.

Further, as an extension to the above step S103, after language conversion, the codes in each file still have differences on specific instructions for the same behavior, and these differences are mainly codes in the files that are irrelevant to the running logic, and these codes may be referred to as redundant codes. The redundant codes in the files are deleted or modified into a unified format, so that subsequent similarity calculation can be facilitated.

Specifically, before the above step S103, the method may include:

step A1: redundant code in the code of each first intermediate file is determined.

Wherein the redundant code is a code which is irrelevant to the actual running logic of the corresponding binary file.

Because the binary files are in different structures, the called instruction sets are different, even if the binary files are converted into preset low-level languages, specific instructions carried by the same behavior in each first intermediate file have differences, and the differences are generally codes which are irrelevant to the actual running logic of the binary files, namely redundant codes. In practical applications, the redundant code may be a code that has not been run in the running process of the binary file, a specific running parameter, and the like. The details of the redundant code are not limited herein.

Step A2: deleting or modifying the redundant codes in each first intermediate file into a preset format to obtain a second intermediate file corresponding to each first intermediate file.

After determining the redundant codes in each first intermediate file, the redundant codes in each first intermediate file can be selected to be deleted directly because the redundant codes are irrelevant to the actual running logic of the binary file, but affect the comparison of the similarity of the subsequent files. Of course, in order to ensure the integrity of the content in the first intermediate files, the redundant codes in each first intermediate file may also be modified to a preset format. The preset format may be a format other than the format used by each first intermediate file, or may be a format used by a certain first intermediate file. Specific content of the preset format is not limited herein.

Accordingly, the step S103 may include: and determining whether binary files corresponding to the two corresponding second intermediate files belong to the same source code according to the similarity of codes in the two corresponding second intermediate files.

According to the above, the redundant codes irrelevant to the actual running logic of the corresponding binary files are determined from the first intermediate files, and deleted or modified into the preset format, so that the codes in the first intermediate files can be unified quickly, the unified efficiency of the code expression of the first intermediate files is quickened under the condition that the substantial content of the first intermediate files is not influenced, and the efficiency of the homologous analysis of the binary files is further improved.

Further, as a refinement to the above step A2, in practical applications, the redundant code may be at least one of an instruction that is not executed, a bitwise operation instruction that is not related to an operation, a Switch instruction, a memory reference instruction, a debug instruction that does not change the execution flow, and an unused global variable. The redundancy code may be selected for deletion from or replacement to the first intermediate file depending on the particular type of redundancy code.

Specifically, the above step A2 may include at least one of the following steps (including in particular which steps are required to be determined according to the specific content of the redundancy code):

step A21: and deleting the non-running instructions in each first intermediate file.

When the redundant code includes non-executed instructions, there may be some instructions in the first intermediate file that are not executed during the execution of the binary file, which may be referred to as dead-positive code. The positive dead code is not practically used in the operation of the binary file, but differences exist in each first intermediate file to influence the comparison of file similarity, so that the positive dead code in the first intermediate file can be deleted, and the data volume of the first intermediate file is reduced.

Step A22: and deleting the bitwise operation instruction which is irrelevant to the operation in each first intermediate file.

When the redundant code includes bitwise operation instructions that are not related to the operation, in the first intermediate file, there may be some instructions that are not related to the bitwise operation of the binary file, and do not determine the actual running logic of the binary file, but there may be differences in the respective first intermediate files that affect the comparison of file similarities, and these instructions may be referred to as bit tracking dead codes. To reduce the amount of data in the first intermediate file, bit tracking dead code that appears in the first intermediate file may be deleted.

Step A23: the Switch instruction in each first intermediate file is rewritten as a branch instruction.

When the redundant code includes a Switch instruction, the Switch instruction is logically related to the actual execution of the binary file, but in a different first intermediate file, for the same behavior, it is sometimes expressed as a Switch instruction, and sometimes as a branch instruction (e.g., if else), which affects the accuracy of the subsequent similarity comparisons. The branch instruction is easier to compare with the Switch instruction in similarity, so the Switch instruction in the first intermediate file needs to be rewritten into the branch instruction to eliminate the Switch structure in the code and eliminate the difference caused by different jump instructions.

Step A24: the memory reference instruction in each first intermediate file is rewritten as a register reference instruction.

When the redundant code includes memory reference instructions, the memory reference instructions are logically related to the actual execution of the binary file, but in a different first intermediate file, for the same behavior, they are sometimes expressed in memory reference instructions and sometimes in register reference instructions, which affects the accuracy of the subsequent similarity comparisons. However, the register reference instruction is easier to perform similarity comparison than the memory reference instruction, so that the memory reference instruction in the first intermediate file needs to be rewritten into the register reference instruction, and then a loaded and stored Alloca instruction is used.

Step A25: and deleting the debugging instructions in each first intermediate file without changing the execution flow.

When the redundant code includes a debug instruction which does not change the execution flow, in the first intermediate file, the debug instruction does not affect the execution flow of the file itself or the judgment of the execution logic similarity of each file, but the load is brought to the file similarity analysis, and the calculation amount of the similarity is increased, so that the debug instruction which does not change the execution flow in the first intermediate file can be deleted.

Step A26: the unused global variables in each first intermediate file are deleted.

When the redundant code includes an unused global variable, the global variable is not used in the binary run, that is, the global variable does not have any effect on the actual run of the binary. Therefore, the unused global variables in the first intermediate file can be deleted, and the data volume of the first intermediate file can be reduced.

According to the above, according to the specific type of the redundant code, the redundant code in the first intermediate file is selected to be deleted or modified, so that the data volume in the first intermediate file can be reduced without affecting the operation logic of the first intermediate file, the similarity comparison efficiency of each intermediate file is improved, and the homology analysis efficiency of the binary file is further improved.

Further, as a refinement of the above step S103, when performing similarity calculation on the corresponding two files, since the core logic of file operation may be embodied by a function, the calculation of the similarity of the corresponding two files can be quickly implemented by the similarity calculation of the function in the corresponding two files.

Specifically, the step S103 may include:

And (B) step (B): and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of the functions in the two corresponding first intermediate files.

Because the functions in the file can directly reflect the running logic of the file, semantic feature extraction is not needed, and errors caused by the semantic feature extraction are avoided. Therefore, the similarity of the functions in the corresponding two first intermediate files can be calculated, the similarity of the corresponding two first intermediate files can be determined, the similarity of the corresponding two binary files can be further determined, and whether the corresponding two binary files are homologous or not can be further determined.

It should be noted that, if only one function exists in each of the two corresponding first intermediate files, the similarity calculation is directly performed on the functions in each of the two corresponding first intermediate files. And if a plurality of functions exist in the corresponding two first intermediate files, each function in a certain first intermediate file and each function in other first intermediate files need to be subjected to pairwise comparison calculation. And if the corresponding two functions are highly similar in the corresponding two first intermediate files, determining the homology of the binary files corresponding to the corresponding two first intermediate files.

Taking two first intermediate files, and two functions exist in the two first intermediate files as an example, it is assumed that a function a1 and a function a2 exist in the first intermediate file a, and a function b1 and a function b2 exist in the first intermediate file b. First, similarity calculation is performed on the function a1 and the function b1, and a similarity 1 is obtained. Then, similarity calculation is performed on the function a1 and the function b2, and similarity 2 is obtained. Then, similarity calculation is performed on the function a2 and the function b1, and similarity 3 is obtained. And finally, carrying out similarity calculation on the function a2 and the function b2 to obtain similarity 4. When the similarity 1 between the function a1 and the function b1 and the similarity 4 between the function a2 and the function b2 are higher, the first intermediate file a and the first intermediate file b are highly similar, and the same behavior is executed to a great extent, so that the two corresponding binary files are determined to belong to the same source code.

According to the above, the functions in the first intermediate file can simply and clearly represent the running logic of the binary file and can be directly extracted from the corresponding file, so that the efficiency of file similarity calculation can be improved while the accuracy of file similarity calculation is determined, and further the efficiency of binary file homology analysis is improved.

Further, as a refinement of the step B, when performing similarity calculation on the functions in the corresponding two first intermediate files, the contents in the two functions are not directly compared one by one from beginning to end, because in such a comparison method, once one more character or one less character is in the middle of one function, the comparison of the subsequent characters is completely misplaced, and the accuracy of function similarity calculation is reduced. Thus, the function needs to be divided into sections, and pairwise comparisons are made by each section.

Specifically, the step B may include:

step B1: a plurality of character strings are respectively extracted from each function of each first intermediate file by adopting a sliding window with a preset character number.

For each function in each first intermediate file, a sliding window with a preset number of characters is required, and from the beginning character of the function, the distance of one character moves to the end of the last character of the function, and each selected character of the sliding window is a character string (which can be regarded as a feature vector). The whole sliding process of the sliding window can obtain a plurality of character strings.

In practical applications, a function may include a plurality of instructions, each of which exists in a line in the code of the file. Therefore, when extracting a plurality of character strings of a function, a sliding window operation of size 3 may be performed for each function in units of instructions. Of course, the size of the sliding window, i.e. the number of characters included, may be any other specific value, and is not limited herein.

For example, assume a certain function as follows:

aabbccddaa

abcdef123

performing sliding window operation with the size of 3 to obtain a plurality of character strings respectively as follows: "aab", "abb", "bbc", "bcc", "ccd", "cdd", "dda", "daa", "abc", "bcd", "cde", "def", "ef1", "f12", "123".

Step B2: and for each function in each first intermediate file, aggregating a plurality of character strings extracted from the functions to obtain a feature sequence corresponding to each function in each first intermediate file.

After deriving a plurality of strings from a function in a certain first intermediate file, the strings may be clustered together as a set, i.e. a feature sequence, for similarity comparison with feature sequences of functions in other first intermediate files.

Continuing with the above example, for a function in a first intermediate file, integrating a plurality of strings extracted therefrom, the resulting feature sequence is: "aab", "abb", "bbc", "bcc", "ccd", "cdd", "dda", "daa", "abc", "bcd", "cde", "def", "ef1", "f12", "123".

Step B3: and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of feature sequences between the two functions in the two corresponding first intermediate files.

When a function has one more character than another function or lacks one character, if the two functions are compared from beginning to end according to the characters, the characters of the two functions cannot be compared after the more or the less characters, so that the calculated similarity of the two functions is lower, and in fact, the similarity of the two functions is higher. For example: function 1 is "abcde" and function 2 is "abbcde". When the second character is compared, the characters of the function 1 and the function 2 are different, until the last characters of the two functions are different, so that the similarity calculation result is lower. In practice, the two functions are similar, the function 2 is only one character more than the function 1, and the similarity calculation result of the two functions is higher.

If the method of steps B21-B23 is adopted, the characteristic sequence of the function 1 is as follows: "abc", "bcd", "cde", function 2 has the following sequence of features: "abb", "bbc", "bcd", "cde". Compared with the feature sequence of the function 1, the function 2 is different from the feature sequence of the function 1 in only one character string, so that the obtained similarity calculation result of the two functions is higher and accords with the reality.

According to the above, the sliding window is adopted to extract a plurality of character strings from the functions, and the similarity calculation is performed between the two functions through the plurality of character strings, so that the problem of low similarity calculation result caused by small difference between the two functions can be avoided, the accuracy of similarity calculation between the functions is improved, and the accuracy of homology analysis of the binary file is further improved.

Further, as a refinement of the above step B3, when the feature sequences of the two functions are subjected to similarity calculation, the character strings in the two feature sequences do not need to be directly compared with each other, and the ratio of the number of the same character strings in the two feature sequences to the total number of the character strings in the feature sequences can be determined.

Specifically, the step B3 may include:

step B31: and determining the total number of the character strings in the feature sequences corresponding to each function of each first intermediate file and the number of the same character strings in the corresponding two feature sequences.

In each first intermediate file, there may be a plurality of functions. For each function there is a sequence of features. In each feature sequence, a plurality of character strings are included. At this time, for each feature sequence of each function in each first intermediate file, the number of character strings contained therein, that is, the total number of character strings in each feature sequence needs to be determined. And comparing the character strings in the feature sequences corresponding to the two functions to determine the number of the same character strings,

For example, assume that the feature sequence of function 1 is: "abc", "bcd", "cde", "def", function 2 has the following characteristic sequences: "abb", "bbc", "bcd", "cde". In the feature sequence of function 1, the number of character strings is 4. In the feature sequence of function 2, the number of character strings is 4. The number of identical strings in both feature sequences is 2.

Step B32: and determining the average number of the character strings in the feature sequences according to the total number of the character strings in each feature sequence.

Continuing with the example above, the total number of strings in the feature sequence of function 1 is 4, and the total number of strings in the feature sequence of function 2 is 4. The average number of character strings in the two feature sequences is (4+4)/2=4.

Step B33: and obtaining the similarity of the feature sequences between every two functions in the corresponding two first intermediate files based on the quotient of the number of the same character strings and the average number.

Continuing with the above example, in the feature sequences of function 1 and function 2, the number of identical strings is 2 and the average number of strings is 4. The similarity of the feature sequences of function 1 and function 2 is 2/4=50%.

Step B34: judging whether the similarity is larger than a preset similarity. If yes, go to step B35, if no, go to step B36.

After the similarity of the feature sequences of the functions between the second files is obtained, the similarity can be compared with a preset similarity so as to determine whether the corresponding binary files are homologous.

The preset similarity can be any value, and can be determined according to the actual homology judgment requirement. When the accuracy of binary homology determination is required to be higher, the preset similarity may be set higher, for example: 90%, 99%, 100%, etc. When the binary homology determination needs to be comprehensively high, the preset similarity may be set to be lower, for example: 50%, 60%, 70%, etc.

Step B35: and determining that the binary files corresponding to the two corresponding first intermediate files belong to the same source code.

When the feature sequences of the functions in the two first intermediate files are similar to each other, the running logic of the two first intermediate files is similar, so that the two binary files corresponding to the two first intermediate files can be determined to be homologous, namely, from the same source code.

In the case where the number of the first intermediate files is three or more, the first intermediate files are judged two by two.

Step B36: and determining that the binary files corresponding to the two corresponding first intermediate files do not belong to the same source code.

When the feature sequences of the functions in the two first intermediate files are low in similarity, the running logic of the two first intermediate files is dissimilar, so that different sources of the two binary files corresponding to the two first intermediate files, namely, source codes from different sources can be determined.

From the above, it can be seen that the comparison efficiency of the feature sequences can be improved by calculating the similarity between the two feature sequences according to the number of the character strings in the feature sequences of the function, so that the efficiency of calculating the similarity of the feature sequences can be improved, and the efficiency of homologous analysis of the binary file can be improved.

Further, as a refinement of the above step B, sometimes, the difference between the two functions is obvious, and it may be determined that the similarity between the two functions is low without performing the similarity calculation, so when the similarity calculation is performed between the functions, the similarity calculation between the two functions with large difference may be filtered out in advance.

Specifically, the step B may include:

step B01: and determining an objective function pair for similarity comparison from the corresponding two first intermediate files.

And the parameter quantity difference and the jump instruction quantity difference of the two functions in the objective function pair are smaller than or equal to a preset difference value.

In each function, one or more parameters are included, as well as one or more jump instructions. If the difference between the two functions is large, the two functions are not identical in probability, and the similarity calculation between the two functions is continued, so that the calculation burden is only increased, and the final result is not identical. Therefore, when the similarity calculation is performed between every two functions in each second file, the function pairs with larger parameter quantity or jump instruction quantity difference between every two functions are cleared out of the similarity calculation queue, and only the objective functions with the same possibility are left for performing the similarity calculation between every two functions.

The preset difference here may be 0, 1, 2, etc. Specific values of the preset difference are not limited herein.

For example, assume that the preset difference is 1, the number of parameters of function 1 in the first intermediate file a is 2, the number of jump instructions is 1, the number of parameters of function 2 in the first intermediate file b is 3, the number of jump instructions is 2, the number of parameters of function 3 in the first intermediate file c is 5, and the number of jump instructions is 1. For the functions 1 and 2, the parameter number differs by 1, the jump instruction number differs by 1, and the jump instruction number is equal to the preset difference, and the function 1 in the first intermediate file a and the function 2 in the first intermediate file b form an objective function pair. For the functions 1 and 3, the parameter quantity differs by 3, the jump instruction quantity differs by 0, the parameter quantity difference is larger than the preset difference, the function pair formed by the function 1 in the first intermediate file a and the function 3 in the first intermediate file c is not an objective function pair, and the similarity calculation is not carried out on the objective function pair. Function 2 and function 3 are also determined in this manner. This determination is required between functions in both documents.

Step B02: and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of the objective function pairs.

After the objective function pairs are obtained, similarity calculation can be performed for each objective function pair, and function similarity calculation can be directly compared, or a plurality of character strings can be obtained by adopting a sliding serial port, so that the objective function pairs are determined according to the number of the corresponding character strings, and the details of the foregoing embodiments are omitted herein.

According to the above, whether the two functions are similar or not can be rapidly determined through the parameter quantity difference and the jump instruction quantity difference in the two functions, so that similarity calculation is only performed on the function pairs with the determined similarity, the calculated amount of the similarity is reduced, and further the efficiency of binary file homology analysis is improved.

Further, as an extension to the above step B, the function generally includes parameters, the parameters in the function are not the same when the function is operated under different environments, and the parameters in the function do not affect the actual operation logic of the function, so when performing the similarity calculation between functions, the influence of the parameters in the function on the similarity calculation between functions needs to be eliminated first.

Specifically, before the step B, the method may further include:

step C: and carrying out constant replacement on the parameters of the function in each first intermediate file to obtain a third intermediate file corresponding to each first intermediate file.

Binary files under different architectures, where the specific parameter content may also vary. After the binary files are subjected to language conversion and unification treatment to obtain first intermediate files, the parameters in the corresponding two first intermediate files still have differences. The differences in these parameters do not determine the running logic of the files, but affect the calculation of the similarity between the files. If these parameters are deleted directly, the running logic of the file is affected. Therefore, the parameters of the corresponding two first intermediate file types need to be replaced by constants.

In practical applications, the parameters may be jump addresses, global variable names, local variable names, etc. The parameters are replaced by constant characters, so that the running logic of the first intermediate files is not affected, and the first intermediate files can be unified on specific parameters. The constant character to be replaced may be any character having the same number of digits as the parameter, and is not limited thereto.

Accordingly, the step B may include: and determining whether binary files corresponding to the corresponding two third intermediate files belong to the same source code according to the similarity of the functions in the corresponding two third intermediate files.

According to the method, before similarity calculation is performed on the functions, constant replacement is performed on parameters in each function, so that influence of different parameters on accuracy of similarity calculation results of the functions can be avoided, normal operation of the functions can be ensured, accuracy of similarity calculation of the functions is improved, and accuracy of homology analysis of binary files is improved.

Finally, a complete embodiment is used to explain the method for determining the homologous binary file provided by the embodiment of the application again.

Fig. 2 is a schematic overall architecture of a method for determining a homologous binary file according to an embodiment of the present application, and referring to fig. 2, the architecture may include: an LLVM IR language processing optimization module 201, a function feature extraction module 202, and a similarity comparison module 203.

The LLVM IR language processing optimization module 201 includes: LLVM IR language conversion sub-module 201a and LLVM IR language optimization sub-module 201b. The LLVM IR language conversion sub-module 201a is configured to convert each binary file into LLVM IR language. The LLVM IR language optimizing submodule 201b is used for optimizing the LLVM IR language through various optimizing strategies, eliminating redundant codes and assimilating execution flow logic.

The function feature extraction module includes 202: a k-gram feature sequence extraction submodule 202a and a function feature screening extraction submodule 202b. The k-gram feature sequence extracting submodule 202a is used for carrying out unified processing on functions in LLVM IR language after redundant codes are eliminated, and for each function, a sliding window is adopted by taking an instruction as a unit to extract k-gram vector features with the size of 3 and form a k-gram feature sequence. The function feature screening and extracting submodule 202b is used for counting the parameter number and the jump instruction number in the functions, and according to the counting result, the functions with the parameter number difference and the jump instruction number difference larger than 1 are not subjected to similarity calculation.

And the similarity comparison module 203 is configured to divide the number of shared k-grams among the functions by the average number of k-grams of the two functions to obtain a similarity between the functions, and finally output a homology comparison result between the binary files.

Fig. 3 is a second flow chart of a method for determining a homologous binary file according to an embodiment of the present application, and referring to fig. 3, the method may include:

s301: and processing the binary file by using the Retdec to obtain the LLVM IR language file of the binary file.

S302: and optimizing the LLVM IR language file through an optimization tool OPT in the LLVM compiling tool chain.

The OPT tool is an optimizer and analyzer of LLVM modularization, takes LLVM IR language as input, performs specified optimization or analysis on the LLVM IR language, and then outputs an optimization file or analysis result.

The method mainly comprises the following steps:

(1) Aggressive Dead Code Elimination: the dead code is eliminated and instructions that have not been run are deleted.

(2) Bit-Tracking Dead Code Elimination: bit tracking dead code is eliminated, and some useless bitwise operation instructions are deleted.

(3) Lower Switch: the Switch instruction is rewritten by using the branch instruction, the Switch structure in the code is eliminated, and the difference caused by the jump instruction is removed.

(4) Promote Memory to Register: memory references are replaced with register references, using load and store Alloca instructions.

(5) Redundant Dbg Instruction Elimination: eliminating redundant debug instructions.

(6) Dead Global Elimination: the unused global variables are eliminated.

S303: and extracting the characteristics of each function in the optimized LLVM IR language file. The method mainly comprises the following steps:

(1) And carrying out unified processing on each function, and replacing the jump address, the global variable name and the local variable name by using constant characters.

(2) And performing sliding window operation with the size of 3 in the instruction unit on each function to obtain a k-gram characteristic sequence of the function LLVM IR instruction, namely an instruction characteristic sequence of the function. And simultaneously, recording the number of parameters and the number of jump instructions in the function for the subsequent function screening.

S304: and calculating the similarity of the k-gram characteristic sequences of the two functions. The method mainly comprises the following steps:

(1) And performing function filtering by the number of function parameters and the number of jump instructions, and not performing similarity comparison calculation between functions with the difference value larger than 1.

(2) For the two functions that make the similarity comparison, the number of k-grams that are common to both k-gram feature sequences is calculated.

(3) The number of k-grams shared by two k-gram feature sequences is divided by the number of k-grams averaged over the two k-gram feature sequences to obtain a similarity between the two functions, with a larger value indicating that the two feature sequences are more similar.

Here, the k-gram is an algorithm based on a statistical language model, and the basic idea is to perform sliding window operation with size N on the content in the text according to bytes, so as to form a byte fragment sequence with length N. Each slice is called a gram. And counting the occurrence frequency of all the grams to form a key gram list, namely a vector feature space of the text. The k-gram can be used to evaluate the degree of difference between two texts.

S305: based on the calculated similarities, it is determined whether the corresponding two binary files are homologous.

From the above, it can be seen that by adopting the LLVM IR language which is close to assembly language and can be optimized as the intermediate expression of the cross-architecture binary file and optimizing the LLVM IR language, the influence on homology judgment caused by redundancy and different execution flow structures after the binary file is converted into the intermediate language can be reduced. And performing feature extraction on the function of the LLVM IR language by adopting a k-gram algorithm, and simultaneously considering the parameter number and the jump instruction number of the function, screening and matching the function, so that two similar functions can be more accurately identified, and further the accuracy of homology analysis of two binary files is improved.

The method for determining the homologous binary file can improve the accuracy of the homologous analysis of the binary file, and is fully verified in practical experiments.

Taking busy box v1.25.1, wget v1.15 and ptunel v1.42 as experimental objects, respectively compiling samples of arm and mips architecture under the ubuntu18.04 environment by using cross compiling tools arm-linux-gnueabi and mips-linux-gnu. The method provided by the embodiment of the application is used for testing the sample. Fig. 4 is a schematic diagram showing the effect of the method for determining the homologous binary file in the embodiment of the present application, and referring to fig. 4, two binary file samples of 3 tools busy box v1.25.1, wget v1.15, ptunel v1.42 under arm and mips architectures are shown, and after LLVM IR optimization is performed, the accuracy of homology determination of two binary file samples of three tools is improved compared with that before optimization. Compared with the existing binary comparison algorithm (Multi-MH) using the intermediate expression, in the experiment of the busy box (mips-arm), the accuracy of the Multi-MH is 80.0%, and the accuracy of the present application is 91.07%. Therefore, by adopting the method for determining the homologous binary file, which is provided by the embodiment of the application, the accuracy of the homologous analysis of the binary file can be improved in practice.

Based on the same inventive concept, as an implementation of the method, the embodiment of the application also provides a device for determining the homologous binary file. Fig. 5 is a schematic structural diagram of a device for determining a homologous binary file according to an embodiment of the present application, and referring to fig. 5, the device may include:

an obtaining module 501, configured to obtain a plurality of binary files;

the conversion module 502 is configured to convert the code in each binary file into a preset low-level language, so as to obtain a first intermediate file corresponding to each binary file, where an instruction in the preset low-level language has a mapping relationship with an instruction in a plurality of different types of high-level languages;

the determining module 503 is configured to determine whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files.

Further, the apparatus further comprises: the processing module is used for determining redundant codes in the codes of each first intermediate file, wherein the redundant codes are codes irrelevant to the actual running logic of the corresponding binary file; deleting or modifying the redundant codes in each first intermediate file into a preset format to obtain a second intermediate file corresponding to each first intermediate file;

The determining module is specifically configured to determine whether binary files corresponding to the two corresponding second intermediate files belong to the same source code according to the similarity of codes in the two corresponding second intermediate files.

Further, the redundancy code includes: at least one of an un-executed instruction, an operation-independent bitwise operation instruction, a Switch instruction, a memory reference instruction, a debug instruction that does not alter the execution flow, and an unused global variable,

the processing module is specifically configured to delete an instruction that is not executed in each first intermediate file; and/or deleting the bitwise operation instruction which is irrelevant to the operation in each first intermediate file; and/or, rewriting the Switch instruction in each first intermediate file into a branch instruction; and/or, rewriting the memory reference instruction in each first intermediate file into a register reference instruction; and/or deleting the debugging instructions which do not change the execution flow in each first intermediate file; and/or deleting the unused global variable in each first intermediate file.

Further, the determining module is specifically configured to determine, according to the similarity of the functions in the two corresponding first intermediate files, whether the binary files corresponding to the two corresponding first intermediate files belong to the same source code.

Further, the determining module is specifically configured to extract a plurality of character strings from each function of each first intermediate file by using a sliding window with a preset number of characters; for each function in each first intermediate file, aggregating a plurality of character strings extracted from the function to obtain a feature sequence corresponding to each function in each second intermediate file; and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of feature sequences between the two functions in the two corresponding first intermediate files.

Further, the determining module is specifically configured to determine a total number of character strings in the feature sequence corresponding to each function of each first intermediate file, and a number of identical character strings in the corresponding two feature sequences; determining the average number of the character strings in the feature sequences according to the total number of the character strings in each feature sequence; obtaining the similarity of feature sequences between every two functions in the corresponding two first intermediate files based on the quotient of the number of the same character strings and the average number; judging whether the similarity is larger than a preset similarity or not; if yes, determining that binary files corresponding to the two corresponding first intermediate files belong to the same source code; if not, determining that the binary files corresponding to the two corresponding first intermediate files do not belong to the same source code.

Further, the determining module is specifically configured to determine, from the two corresponding first intermediate files, an objective function pair for performing similarity comparison, where a parameter number difference and a jump instruction number difference of two functions in the objective function pair are smaller than or equal to a preset difference; and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of the objective function pairs.

Further, the determining module is further configured to perform constant replacement on parameters of the function in each first intermediate file to obtain a third intermediate file corresponding to each first intermediate file;

the determining module is specifically configured to determine, according to the similarity of the functions in the corresponding two third intermediate files, whether binary files corresponding to the corresponding two third intermediate files belong to the same source code.

Further, the preset low-level language is LLVM IR language, and the conversion module is specifically configured to convert codes in each binary file into LLVM IR language, so as to obtain LLVM IR language files corresponding to each binary file.

It should be noted here that the description of the above device embodiments is similar to the description of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

Based on the same inventive concept, the embodiment of the application also provides electronic equipment. Fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present application, and referring to fig. 6, the electronic device may include: a processor 601, a memory 602, a bus 603; wherein, the processor 601 and the memory 602 complete communication with each other through the bus 603; the processor 601 is operative to invoke program instructions in the memory 602 to perform the methods in one or more embodiments described above.

It should be noted here that the description of the above embodiments of the electronic device is similar to the description of the above embodiments of the method, with similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device of the present application, please refer to the description of the method embodiments of the present application for understanding.

Based on the same inventive concept, embodiments of the present application also provide a computer-readable storage medium, which may include: a stored program; wherein the program, when executed, controls a device in which the storage medium resides to perform the methods of one or more of the embodiments described above.

It should be noted here that the description of the above embodiments of the storage medium is similar to the description of the above embodiments of the method, with similar advantageous effects as the embodiments of the method. For technical details not disclosed in the storage medium embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for determining a homologous binary file, the method comprising:

acquiring a plurality of binary files;

converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file, wherein the mapping relation exists between instructions in the preset low-level language and instructions in a plurality of different types of high-level languages;

and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files.

2. The method of claim 1, wherein before determining whether binary files corresponding to the respective two first intermediate files belong to the same source code according to the similarity of codes in the respective two first intermediate files, the method further comprises:

Determining redundant codes in codes of each first intermediate file, wherein the redundant codes are codes irrelevant to the actual running logic of the corresponding binary file;

deleting or modifying the redundant codes in each first intermediate file into a preset format to obtain a second intermediate file corresponding to each first intermediate file;

according to the similarity of codes in the corresponding two first intermediate files, determining whether binary files corresponding to the corresponding two first intermediate files belong to the same source code comprises the following steps:

and determining whether binary files corresponding to the two corresponding second intermediate files belong to the same source code according to the similarity of codes in the two corresponding second intermediate files.

3. The method of claim 2, wherein the redundant code comprises: at least one of an un-executed instruction, a bitwise operation instruction which is irrelevant to operation, a Switch instruction, a memory reference instruction, a debug instruction which does not change an execution flow, and an unused global variable, wherein deleting or modifying the redundant code in each first intermediate file into a preset form comprises:

deleting the un-operated instruction in each first intermediate file; and/or the number of the groups of groups,

Deleting bitwise operation instructions which are irrelevant to operation in each first intermediate file; and/or the number of the groups of groups,

rewriting a Switch instruction in each first intermediate file into a branch instruction; and/or the number of the groups of groups,

rewriting memory reference instructions in each first intermediate file into register reference instructions; and/or the number of the groups of groups,

deleting the debugging instructions which do not change the execution flow in each first intermediate file; and/or the number of the groups of groups,

the unused global variables in each first intermediate file are deleted.

4. The method according to claim 1, wherein determining whether the binary files corresponding to the respective two first intermediate files belong to the same source code according to the similarity of the codes in the respective two first intermediate files comprises:

and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of the functions in the two corresponding first intermediate files.

5. The method according to claim 4, wherein determining whether the binary files corresponding to the respective two first intermediate files belong to the same source code according to the similarity of the functions in the respective two first intermediate files comprises:

a sliding window with a preset character number is adopted, and a plurality of character strings are respectively extracted from each function of each first intermediate file;

For each function in each first intermediate file, aggregating a plurality of character strings extracted from the function to obtain a feature sequence corresponding to each function in each second intermediate file;

and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of feature sequences between the two functions in the two corresponding first intermediate files.

6. The method according to claim 5, wherein determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of feature sequences between the two functions in the two corresponding first intermediate files comprises:

determining the total number of character strings in the feature sequences corresponding to each function of each first intermediate file and the number of identical character strings in the corresponding two feature sequences;

determining the average number of the character strings in the feature sequences according to the total number of the character strings in each feature sequence;

obtaining the similarity of feature sequences between every two functions in the corresponding two first intermediate files based on the quotient of the number of the same character strings and the average number;

judging whether the similarity is larger than a preset similarity or not;

If yes, determining that binary files corresponding to the two corresponding first intermediate files belong to the same source code;

if not, determining that the binary files corresponding to the two corresponding first intermediate files do not belong to the same source code.

7. The method according to claim 4, wherein determining whether the binary files corresponding to the respective two first intermediate files belong to the same source code according to the similarity of the functions in the respective two first intermediate files comprises:

determining an objective function pair for similarity comparison from the corresponding two first intermediate files, wherein the parameter quantity difference and the jump instruction quantity difference of the two functions in the objective function pair are smaller than or equal to a preset difference value;

and determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of the objective function pairs.

8. The method of claim 4, wherein before determining whether binary files corresponding to the respective two first intermediate files belong to the same source code according to the similarity of functions in the respective two first intermediate files, the method further comprises:

constant replacement is carried out on parameters of the functions in each first intermediate file, and a third intermediate file corresponding to each first intermediate file is obtained;

According to the similarity of the functions in the corresponding two first intermediate files, determining whether the binary files corresponding to the corresponding two first intermediate files belong to the same source code includes:

and determining whether binary files corresponding to the corresponding two third intermediate files belong to the same source code according to the similarity of the functions in the corresponding two third intermediate files.

9. The method according to any one of claims 1 to 8, wherein the preset low-level language is LLVM IR language, and the converting the code in each binary file into the preset low-level language, to obtain a first intermediate file corresponding to each binary file, includes:

and converting codes in each binary file into LLVM IR language to obtain LLVM IR language files corresponding to each binary file.

10. A device for determining a homologous binary file, the device comprising:

the acquisition module is used for acquiring a plurality of binary files;

the conversion module is used for converting codes in each binary file into a preset low-level language to obtain a first intermediate file corresponding to each binary file, wherein the instructions in the preset low-level language have a mapping relation with the instructions in a plurality of different types of high-level languages;

And the determining module is used for determining whether binary files corresponding to the two corresponding first intermediate files belong to the same source code according to the similarity of codes in the two corresponding first intermediate files.

11. An electronic device, the electronic device comprising: a processor, a memory, a bus; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 9.

12. A computer-readable storage medium, the storage medium comprising: a stored program; wherein the program, when run, controls a device in which the storage medium is located to perform the method of any one of claims 1 to 9.