CN114492365A

CN114492365A - Method for determining similarity between binary files, computing device and storage medium

Info

Publication number: CN114492365A
Application number: CN202210022271.1A
Authority: CN
Inventors: 杨晋
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-13

Abstract

The embodiment of the application provides a method for determining similarity between binary files, computing equipment and a storage medium. In the embodiment of the application, functions in a plurality of binary files are obtained; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity between the two binary files according to the file characteristic value corresponding to the file characteristic. The function characteristics of the function are determined based on the unchangeable information in the function, information causing interference can be eliminated, the unchangeable information without interference is reserved, the corresponding function characteristics can be determined more accurately, and therefore the file characteristics can be determined more accurately. And determining the similarity among the binary files according to the file characteristic values corresponding to the file characteristics, thereby further realizing the automatic and rapid identification of the binary files and the more accurate identification of the binary files.

Description

Method for determining similarity between binary files, computing device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a computing device, and a storage medium for determining similarity between binary files.

Background

For code detection, since a source code cannot be acquired in many scenarios, a corresponding binary file needs to be acquired. But for the same source code, the corresponding binary files compiled may vary widely, even with minor modifications. Such as there may be address changes, address skews, sign changes, etc. These factors may affect the similarity result of the binary file corresponding to the source code, and further cause that the meaning of the corresponding source code cannot be accurately identified.

Disclosure of Invention

Aspects of the present disclosure provide a method, a computing device, and a storage medium for determining similarity between binary files, so that the similarity between binary files corresponding to a source code can be determined more accurately and quickly.

The embodiment of the application provides a method for determining similarity between binary files, which comprises the following steps: acquiring functions in a plurality of binary files; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity among the plurality of binary files according to the file characteristic value corresponding to the file characteristic.

An embodiment of the present application further provides a computing device, including: a memory, a processor; the memory for storing a computer program; the processor executing the computer program to: acquiring functions in a plurality of binary files; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity among the plurality of binary files according to the file characteristic value corresponding to the file characteristic.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by one or more processors causes the one or more processors to implement the steps of the above-mentioned method.

In the embodiment of the application, functions in a plurality of binary files are obtained; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity among the binary files according to the file characteristic value corresponding to the file characteristic.

The function characteristics of the function are determined based on the unchangeable information in the function, interference information caused by the interference information can be eliminated, the unchangeable information without interference is reserved, and the corresponding function characteristics can be determined more accurately, so that the file characteristics can be determined more accurately. And determining the similarity among the binary files according to the file characteristic values corresponding to the file characteristics, thereby further realizing automatic rapid identification and more accurate identification of the binary files.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart illustrating a method for determining similarity between binary files according to an exemplary embodiment of the present application;

FIG. 2 is a diagram illustrating a process of determining a document feature value according to an exemplary embodiment of the present application;

FIG. 3 is a diagram illustrating functional feature extraction according to an exemplary embodiment of the present application;

FIG. 4 is a schematic structural diagram of a system for determining similarity between binary files according to an exemplary embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for determining similarity between binary files according to an exemplary embodiment of the present application;

fig. 6 is a schematic structural diagram of a computing device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

According to the foregoing, even if the same source code is slightly modified, the compiled binary files may have large differences, and factors such as address change, address misalignment, symbol change, etc. all affect the recognition result, thereby causing a failure in accurately determining the similarity of the binary files.

Based on this, the embodiment of the application provides a method, a computing device and a storage medium for determining similarity between binary files, extracts corresponding feature data, removes noise data which generates interference, and finally determines the similarity of the binary files according to the feature data. The embodiment of the application has the advantages of good universality, high efficiency, high accuracy and the like.

The following describes a determination process of similarity between binary files in detail with reference to the method embodiment.

Fig. 1 is a flowchart illustrating a method for determining similarity between binary files according to an exemplary embodiment of the present application. The method 100 provided by the embodiment of the present application is executed by a computing device, such as a server (specifically, a cloud server). The method 100 comprises the steps of:

101: functions in a plurality of binary files are obtained.

102: and determining the function characteristics of the function based on the unchangeable information in the function.

103: and determining the file characteristics corresponding to the binary file according to the function characteristics.

104: and determining the similarity among the binary files according to the file characteristic value corresponding to the file characteristic.

The following is set forth in detail with respect to the above steps:

101: functions in a plurality of binary files are obtained.

The binary file may refer to an unknown binary file, or may be a known binary file. The binary file is a file generated after the source code is compiled. A plurality of functions (or instructions) may be recorded or contained in the file. The binary file can be a binary file, and can also be a file in other forms, such as a decimal file, according to requirements. Whereby a plurality of functions or instructions can be retrieved from the file.

The function or the instruction may include a prefix code (also referred to as a prefix instruction, such as a lock prefix instruction), an operation code, a string constant, and the like.

For example, the server may automatically collect a plurality of unknown binary files from other platforms, such as other servers, and automatically send the plurality of unknown binary files to the server. Or, the unknown binary file is collected in a manual mode, and the unknown binary file can be collected in a manual mode. The corresponding unknown binary file may then be sent to the server in a manually triggered manner.

After the server acquires the binary file, a plurality of unknown functions can be acquired from each unknown binary file.

Specifically, obtaining functions in a plurality of binary files includes: acquiring a binary file, and identifying a function boundary in the binary file; and dividing the corresponding functions in the binary file according to the identified function boundaries to obtain the functions.

The method for identifying the function boundary can be identified by a conventional method, such as a linear scanning algorithm, a recursive descent algorithm, and the like, and can also be identified by a model method.

The identification by the model mode may be training by a neural network model, for example, training the convolutional neural network model CNN by a large number of function samples, and the trained neural network model may identify the function, that is, may determine the boundary of the function.

For example, according to the above description, the trained neural network model or function recognition tool is used to recognize the unknown function in the unknown binary file, determine the function boundary, and thus partition the function in the binary file.

Specifically, identifying function boundaries in a binary file includes: obtaining a plurality of function samples, and determining a function structure according to the plurality of function samples; and dividing function boundaries in the binary file according to the function structure.

For example, according to the foregoing, the server may also determine structures corresponding to different types of functions, such as a structural form of the function, by collecting a plurality of function samples and then analyzing the function samples. And then, according to the analyzed function structure, identifying the function in the unknown binary file, and determining the boundary of the function, thereby dividing the corresponding function.

The non-changeable information refers to information that is not changed or modified. Corresponding to this is variable information, which means information that can be changed and modified. Such as symbol information, can be changed and deleted. User global variable information, including configuration items, may vary. And address information, which depends on the acquisition platform, the compiler, and the compilation option, to be changed.

For example, as described above, the server may determine the functional characteristics of the function based on the non-variable information, e.g., remove the variable information, and then determine the functional characteristics based on the remaining information.

It should be noted that, even if the same source code is slightly modified, the compiled binary file may be greatly different, and factors such as address change, address misalignment, symbol change, and the like all affect the recognition result of the subsequent binary file. The influence result can be reduced by removing the information causing the change.

For example, two malicious samples of the same family have the same logic function, and after the program runs, a rebound shell (that is, a Control end monitors a TCP (Transmission Control Protocol)/UDP (User data Protocol) port, and a controlled end initiates a request to the port and transfers input and output of a command line to the Control end) is performed. The two malicious samples differ only by the following two points: the first is that the configuration items are different, such as C & C domain name, online package information and the like. The second is sign difference, where one sample has the sign removed and the other sample has no sign removed.

Therefore, the unchangeable information can be removed through the embodiment of the application, and the malicious sample can be identified.

The determining the function characteristics of the function based on the unchangeable information in the function comprises the following steps: the changeable information in the function is removed, and the other information (i.e., the unchangeable information) in the function is taken as the function feature. The method 100 may further include: and removing changeable information in the function and adding non-changeable information in the function to determine the function characteristic.

For example, as described above, the server may remove the variable information in the function, such as symbols, and then use other information in the function directly as the function feature.

In order to better identify the binary file, the following method can be used to extract the feature of the function.

The determining the function characteristics of the function based on the unchangeable information in the function comprises the following steps: the feature of the function is generated according to the code segment corresponding to the function (i.e. no data segment in the binary file participates).

It should be noted that the variable information is contained in a data segment in the binary file. The data segment may be used to store data.

For example, as described above, the server generates the feature of the function from the code segment corresponding to the function. So that the data segments in the binary file can be removed.

A code segment is a code segment that can be used to store logic code.

In addition, the embodiment of the application uses the code logic in the binary file as the main characteristic for distinguishing the binary file, such as the binary file, so that the data segment can be regarded as noise data, and after the content of the data segment is removed, the influence on the recognition result can be reduced.

In addition to having variable information in the data sections described above, the variable information in the code sections may also be removed for finer granularity of partitioning.

Specifically, the generating of the function feature according to the code segment corresponding to the function includes: removing operands in the code segment and taking other information in the code segment as a function characteristic.

The operand may contain a user variable, and the content of the user variable is variable. The operands may also contain address values, which may vary depending on the source code, platform, compiler, compilation options. Operands may also include registers, which may vary depending on architecture, platform, and compiler.

For example, according to the foregoing, the server may not only remove the changeable information in the data segment in the binary file, i.e., directly select the code segment for feature extraction, but also further remove the operand in the code segment. The characteristics of the function are determined from the remaining information in the code segment or the remaining information may be used directly as the characteristics of the function.

In the feature extraction stage, the operands can also be regarded as noise data, and the influence on the recognition result can be further reduced by removing the operands.

In order to be able to determine the characteristics of the function more quickly and conveniently, this is determined in the following manner.

Specifically, taking other information in the code segment as a function characteristic includes: and extracting the prefix code and the operation code in the code segment as function characteristics.

For example, the server may remove the operands in the code segment corresponding to each function, as described above. And then, for the rest information in the code segment, acquiring a prefix code and an operation code of the code segment in the function as the function characteristics.

In order to express the function features more accurately, the operation code herein may be a bytecode that expresses the native meaning of the operation code by the corresponding operation code. By acquiring corresponding features from the code, the similarity between binary files can be calculated from the code logic level.

As shown in fig. 3, in an unknown function 300, the bytecode "55" 301 corresponding to the opcode "push" 304 needs to be extracted. In addition, the prefix code "48" 302 also needs to be extracted. The bytecode "89" 305 corresponding to the operation code mov needs to be extracted. While the operand "E5" 303 does not need to be fetched and needs to be removed.

According to the method, the function characteristics of each function in each binary file can be extracted. From this it can be determined that the functional characteristic of a function is:

\x55\x48\x89\x53\xbb\x48\x83\x48\x8b\x48\x83\x74\x0f\x1f\x48\x83\xff\x48\x8b\x48\x83\x75\x48\x83\x5b\x5d\xc3。

as shown in fig. 2, according to the above manner, the server may obtain the binary file 201, and then perform feature extraction 203 by traversing the function 202, that is, traversing the functions identified by the binary file 201, to determine the function features of each function.

In addition, the assembly code obtained from the binary file can be converted into a uniform intermediate language IR before feature extraction, so that the influence caused by different architectures is eliminated. For example, x86 assembly code, arm assembly code, mips assembly code can be uniformly converted into an intermediate language, VEX IR or LLVM IR. For compiling optimization, the instructions can be reconstructed, and the influence caused by the difference between a compiler and a compiling option is eliminated. If the instruction is reconstructed according to the preset rule, the instruction is reconstructed into the instruction meeting the preset rule. Such as a number of instructions, etc.

Disassembly can then be performed to extract the features.

Specifically, the function features in the binary file may be combined according to a certain order to obtain the file features corresponding to the binary file.

Specifically, determining the file characteristics corresponding to the binary file according to the function characteristics includes: and combining the function characteristics corresponding to the functions in the binary file according to the address sequence of the functions in the binary file to obtain the file characteristics.

The address sequence may refer to the sequence of addresses from small to large, or from large to small. Wherein, the address refers to the position for storing the corresponding function.

According to the foregoing, the function features of the functions are merged according to the address order, such as the order from small to large, to obtain corresponding merged function features, which are used as corresponding file features.

Besides the above sequence, the function features can be combined according to the execution sequence of the functions, which is not described again.

Wherein the merging of features 204 is performed according to the scheme described above, as shown in figure 2.

Wherein the method 100 further comprises: determining a file characteristic value corresponding to the file characteristic through a fuzzy hash algorithm and the file characteristic, so as to execute the step 104.

The fuzzy hash algorithm is also called a fragmentation hash algorithm based on content segmentation. Fuzzy hashing uses a weak hash to calculate the local content of a file, the file is sliced under a specific condition, then a strong hash is used to calculate the hash value of each slice of the file, a part of the values is taken and connected, and the fuzzy hash result is formed together with the slicing condition.

Besides, the corresponding characteristic value can be determined by a sim hash algorithm. As shown in fig. 2, the obtained file feature is calculated as a feature value 205.

For example, as described above, the server may first determine the feature value of the file feature of the unknown binary file according to a fuzzy hash algorithm (more specifically, a ssdeep tool for computing a fuzzy hash). And then the server determines the file characteristics of the known malicious binary files or the unknown binary files according to the mode, and determines the characteristic values corresponding to the malicious binary files or the file characteristic values of the unknown binary files through a fuzzy hash algorithm. And comparing the two characteristic values to determine whether the unknown binary file belongs to a malicious binary file or whether the two files are similar.

Specifically, determining similarity between a plurality of binary files according to file feature values corresponding to file features includes: comparing the file characteristic value corresponding to the unknown binary file with the file characteristic value corresponding to the known binary file, and determining the similarity between the file characteristic values; and determining the similarity of the binary files according to the similarity between the characteristic values of the files.

The known file refers to a binary file of a known type, such as a malicious binary file, a normal binary file, and the like.

For example, according to the foregoing description, the server compares the file attribute values of the two files, and determines the similarity between the two files through an edit distance algorithm. And when the similarity is larger than or equal to the threshold value, determining that the two binary files are similar, and determining that the unknown binary file is a malicious binary file.

If the similarity is less than the threshold, the two binary files are not similar. If so, the unknown binary is determined not to be a malicious binary.

It should be noted that whether the file characteristics of the two binary files are similar or not can also be determined by means of a neural network model. The neural network model, such as a convolutional neural network model, may be trained by file feature samples of similar binary files, and file feature samples of dissimilar binary files. And determining the similarity of the trained model. When the similarity is greater than or equal to the threshold, the model may determine that the two documents are similar, otherwise they are not.

It should be noted that, in the embodiment of the present application, since the data segment is removed and the operand is also removed from the code segment, the effect of compiler-related variables such as address change, address misalignment, and symbol change is eliminated, so that the fuzzy hash algorithm can be effectively applied to the similarity comparison of binary files. As can be seen from the foregoing, for the two malicious samples in the same family, the logic functions of the two sample codes are the same, and after the program runs, the program performs a bounce shell (that is, the Control end monitors a TCP (Transmission Control Protocol)/UDP (User data packet Protocol) port, and the controlled end initiates a request to the port and transfers the input and output of the command line to the Control end). The two malicious samples differ only by the following two points: the first is that the configuration items are different, such as C & C domain names, online package information and the like. The second is sign difference, where one sample has the sign removed and the other sample has no sign removed. The fuzzy hash calculated by the embodiment of the application is used for comparing the similarity of the binary files, the similarity is higher, and the similarity is determined. Therefore, malicious code detection can be performed through the embodiment of the application, and whether the binary file corresponding to the unknown code is malicious or not is determined. The method can also be applied to the mining of the open source code vulnerability, and the similarity between the unknown binary file and the binary file with the vulnerability is determined, so that whether the unknown binary file has the vulnerability or not is determined. In addition, the method can also be applied to electronic evidence obtaining, and whether code plagiarism and other problems exist is determined through the determination of the similarity.

Furthermore, in addition to the above-described application, each function in the binary file may be determined in the manner described above, thereby identifying the respective function in the binary file and determining the composition of the binary file.

Specifically, the method 100 further includes: after the function characteristic is determined, determining a function characteristic value corresponding to the function characteristic of the function; and determining the similarity between the functions according to the function characteristic value corresponding to the unknown function and the function characteristic value corresponding to the known function.

The function feature value of each unknown function in the unknown binary file is determined in a similar manner as described above, e.g., fuzzy hashing. Then, the similarity is determined by comparing with the function characteristic value corresponding to the known function, so as to determine the type or name of each function. It will not be described in detail.

The specific ratio identification mode can be as follows: comparing the characteristic value corresponding to the unknown function with the characteristic value corresponding to the known function, and determining the similarity between the characteristic values; and identifying the unknown function according to the similarity between the characteristic values.

Since similar implementations have been set forth above, they will not be described in detail here. Only the description is as follows: the similarity may also be determined by the edit distance algorithm described above.

Specifically, the method 100 further includes: and determining the function composition in the corresponding binary file according to the identified function.

The server can know the function composition of the binary file, namely the composition components of the binary file according to the determined function.

Fig. 4 is a schematic structural diagram of a system for determining similarity between binary files according to an exemplary embodiment of the present application. As shown in fig. 4, the system 400 may include: a first device 401 and a second device 402.

The first device 401 may be a device that can provide a computing processing service in a network virtual environment, and may be a device that determines similarity of binary files using a network. In physical implementation, the first device 401 may be any device capable of providing a computing service, responding to a service request, and performing similarity determination of a binary file, and may be, for example, a cloud server, a cloud host, a virtual center, a conventional server, and the like, on which a database is structured. The first device 401 mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture.

It should be noted that a specific implementation form of the first device 401 may be a physical device or a virtual device, which may be used for determining the similarity of the binary file.

In addition, the specific implementation form of the first device 401 may also be a distributed architecture composed of multiple physical devices or virtual devices, and the distributed architecture may improve throughput of processing tasks and facilitate expansion and reduction according to services.

The second device 402 may be a device with certain computing capability, and may implement a function of transmitting data to the first device 401, or may receive data transmitted by the first device 401. The basic structure of the second device 402 may include: at least one processor. The number of processors may depend on the configuration and type of device with a certain computing power. A device with certain computing capabilities may also include Memory, which may be volatile, such as RAM, non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or both. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. In addition to the processing unit and the memory, the device with certain computing capabilities also includes some basic configurations, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, a stylus, and the like. Other peripheral devices are well known in the art and will not be described in detail herein. Alternatively, the second device 402 may be a smart terminal, such as a cell phone, a desktop computer, a notebook, a tablet computer, and so on.

Specifically, the first device 401 acquires a function in a plurality of binary files; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity among the binary files according to the file characteristic value corresponding to the file characteristic.

In addition, the second device 402 may send the binary file to the first device 401.

Specifically, the first device 401 acquires a plurality of binary files, and identifies function boundaries in the binary files; and dividing the corresponding functions in the binary file according to the identified function boundaries to obtain the functions.

In addition, the first device 401 removes changeable information in the function and takes other information in the function as unchangeable information to determine the function characteristics.

Specifically, the first device 401 generates a function feature according to a code segment corresponding to the function.

Specifically, the first device 401 removes operands in the code segment and takes other information in the code segment as a function characteristic.

Specifically, the first device 401 extracts the prefix code and the operation code in the code segment as the function feature.

Specifically, the first device 401 obtains a plurality of function samples, and determines a function structure according to the plurality of function samples; and dividing function boundaries in the binary file according to the function structure.

Specifically, the first device 401 merges the function features corresponding to the functions in the binary file according to the address sequence of the functions in the corresponding binary file, so as to obtain the file features.

In addition, the first device 401 determines a file feature value corresponding to the file feature through a fuzzy hash algorithm and the file feature, so that the similarity between the plurality of binary files is determined according to the file feature value corresponding to the file feature.

Specifically, the first device 401 compares a file characteristic value corresponding to an unknown binary file with a file characteristic value corresponding to a known binary file, and determines a similarity between the file characteristic values; and determining the similarity of the binary files according to the similarity between the characteristic values of the files.

In addition, the first device 401, after determining the function characteristic, determines a function characteristic value corresponding to the function characteristic of the function; and determining the similarity between the functions according to the function characteristic value corresponding to the unknown function and the function characteristic value corresponding to the known function.

In the binary file identification scenario, a first device 401, such as a server, may acquire an unknown binary file manually from a second device 402, such as a computer, and then may send the corresponding unknown binary file to the server in a manually triggered manner. Step 411 is executed: and sending the binary file to be identified. After the server acquires the binary files to be identified, the binary files can be used as unknown binary files, and a plurality of unknown functions are acquired from each unknown binary file.

The server can remove the operand in the code segment, thereby removing the data segment in the function, and then extract the prefix code and the bytecode corresponding to the operation code in the code segment as the function characteristic. And the server merges the function characteristics of the functions according to the address sequence, such as the sequence from small to large, to obtain corresponding merged function characteristics as corresponding file characteristics. And the server determines a characteristic value corresponding to the file characteristic through a fuzzy hash algorithm and the file characteristic. For example, the server may first determine the feature value of the file feature of the unknown binary file according to a fuzzy hash algorithm (more specifically, a ssdeep tool for computing a fuzzy hash). And then the server determines the file characteristics of the known malicious binary files according to the mode, and determines the characteristic value corresponding to the malicious binary files through a fuzzy hash algorithm. And comparing the two characteristic values to determine whether the unknown binary file belongs to the malicious binary file. Finally, the server returns the recognition result to the computer, and then executes step 412: and sending the identification result.

For the content not described in detail herein, reference may be made to the content described above, and thus, the description thereof is omitted.

In the present embodiment described above, the first device 401 and the second device 402 are connected to each other via a network. If the first device 401 and the second device 402 are communicatively connected, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, 5G, and so on.

Fig. 5 is a schematic structural framework diagram of an apparatus for determining similarity between binary files according to an exemplary embodiment of the present application. The apparatus 500 may be applied to a computing device, such as a server. The apparatus 500 comprises: an acquisition module 501, a determination module 502 and an identification module 503; the following detailed description is directed to the functions of the various modules:

an obtaining module 501, configured to obtain functions in multiple binary files.

A determining module 502, configured to determine a function characteristic of the function based on the unchangeable information in the function.

The determining module 502 is configured to determine a file feature corresponding to the binary file according to the function feature.

The identifying module 503 is configured to determine similarity between the binary files according to the file feature values corresponding to the file features.

Specifically, the obtaining module 501 includes: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of binary files and identifying function boundaries in the binary files; and the dividing unit is used for dividing the corresponding function in the binary file according to the identified function boundary so as to obtain the function.

Specifically, the determining module 502 is configured to remove changeable information in the function and use other information in the function as unchangeable information to determine the function characteristic.

Specifically, the determining module 502 is configured to generate a function feature according to a code segment corresponding to a function.

Specifically, the determining module 502 is configured to remove operands in the code segment and use other information in the code segment as the function characteristics.

Specifically, the determining module 502 is configured to extract a prefix code and an operation code in a code segment as a function feature.

Specifically, the dividing unit is configured to obtain a plurality of function samples, and determine a function structure according to the plurality of function samples; and dividing function boundaries in the binary file according to the function structure.

Specifically, the determining module 502 is configured to combine the function features corresponding to the functions in the binary file according to the address sequence of the functions in the corresponding binary file to obtain the file features.

In addition, the determining module 502 is further configured to determine a file feature value corresponding to the file feature through a fuzzy hash algorithm and the file feature, so that the similarity between the plurality of binary files is determined according to the file feature value corresponding to the file feature.

Specifically, the identifying module 503 is configured to compare a file characteristic value corresponding to an unknown binary file with a file characteristic value corresponding to a known binary file, and determine a similarity between the file characteristic values; and determining the similarity of the binary files according to the similarity between the characteristic values of the files.

In addition, the determining module 502 is further configured to determine a function characteristic value corresponding to the function characteristic of the function after determining the function characteristic; the identifying module 503 is further configured to determine similarity between the functions according to the function characteristic value corresponding to the unknown function and the function characteristic value corresponding to the known function.

For the content of the apparatus 500 that is not detailed above, reference is made to the above description, and thus, the description is not repeated.

While the internal functions and structures of the apparatus 500 shown in FIG. 5 are described above, in one possible design, the structures of the apparatus 500 shown in FIG. 5 may be implemented as a computing device, such as a server. As shown in fig. 6, the apparatus 600 may include: a memory 601, a processor 602;

the memory 601 is used for storing computer programs.

A processor 602 for executing a computer program for: acquiring functions in a plurality of binary files; determining function characteristics of the function based on the unchangeable information in the function; determining the file characteristics corresponding to the binary file according to the function characteristics; and determining the similarity among the binary files according to the file characteristic value corresponding to the file characteristic.

Specifically, the processor 602 is specifically configured to: acquiring a plurality of binary files, and identifying function boundaries in the binary files; and dividing the corresponding functions in the binary file according to the identified function boundaries to obtain the functions.

Specifically, the processor 602 is specifically configured to: and removing changeable information in the function and taking other information in the function as unchangeable information to determine the function characteristic.

Specifically, the processor 602 is specifically configured to: and generating function characteristics according to the code sections corresponding to the functions.

Specifically, the processor 602 is specifically configured to: removing operands in the code segment and taking other information in the code segment as a function characteristic.

Specifically, the processor 602 is specifically configured to: and extracting the prefix code and the operation code in the code segment as function characteristics.

Specifically, the processor 602 is specifically configured to: obtaining a plurality of function samples, and determining a function structure according to the plurality of function samples; and dividing function boundaries in the binary file according to the function structure.

Specifically, the processor 602 is specifically configured to: and combining the function characteristics corresponding to the functions in the binary file according to the address sequence of the functions in the binary file to obtain the file characteristics.

Further, the processor 602 is further configured to: and determining a file characteristic value corresponding to the file characteristic through a fuzzy hash algorithm and the file characteristic, so that the step of determining the similarity among the plurality of binary files according to the file characteristic value corresponding to the file characteristic is executed.

Specifically, the processor 602 is specifically configured to: comparing the file characteristic value corresponding to the unknown binary file with the file characteristic value corresponding to the known binary file, and determining the similarity between the file characteristic values; and determining the similarity of the binary files according to the similarity between the characteristic values of the files.

Further, the processor 602 is further configured to: after determining the function characteristics, determining function characteristic values corresponding to the function characteristics of the function; and determining the similarity between the functions according to the function characteristic value corresponding to the unknown function and the function characteristic value corresponding to the known function.

Embodiments of the present invention provide a computer storage medium, where a computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for determining similarity between binary files in the method embodiments of fig. 1-3. Will not be redundantly described.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, 103, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described solutions and/or portions thereof that are prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein (including but not limited to disk storage, CD-ROM, optical storage, etc.).

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable multimedia data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable multimedia data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable multimedia data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable multimedia data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for determining similarity between binary files is characterized by comprising the following steps:

acquiring functions in a plurality of binary files;

determining function characteristics of the function based on the unchangeable information in the function;

determining the file characteristics corresponding to the binary file according to the function characteristics;

and determining the similarity among the plurality of binary files according to the file characteristic value corresponding to the file characteristic.

2. The method of claim 1, wherein obtaining the functions in the plurality of binary files comprises:

acquiring a plurality of binary files, and identifying function boundaries in the binary files;

and dividing the corresponding functions in the corresponding binary files according to the identified function boundaries to obtain the functions.

3. The method of claim 1, further comprising:

and removing changeable information in the function and taking other information in the function as unchangeable information to determine the function characteristic.

4. The method of claim 1 or 3, wherein determining the function characteristic of the function based on the immutable information in the function comprises:

and generating function characteristics according to the code sections corresponding to the functions.

5. The method of claim 4, wherein generating the function characteristics from the functionally corresponding code segments comprises:

removing operands in the code segment and using other information in the code segment as functional characteristics.

6. The method of claim 5, wherein the characterizing other information in the code segment as a function comprises:

and extracting the prefix code and the operation code in the code segment as function characteristics.

7. The method of claim 2, wherein the identifying function boundaries in the binary file comprises:

obtaining a plurality of function samples, and determining a function structure according to the plurality of function samples;

and dividing function boundaries in the binary file according to the function structure.

8. The method of claim 1, wherein determining the file characteristics corresponding to the binary file according to the function characteristics comprises:

and combining the function characteristics corresponding to the functions in the binary file according to the address sequence of the functions in the corresponding binary file to obtain the file characteristics.

9. The method of claim 1, further comprising:

and determining a file characteristic value corresponding to the file characteristic through a fuzzy hash algorithm and the file characteristic, so that the step of determining the similarity among the plurality of binary files according to the file characteristic value corresponding to the file characteristic is executed.

10. The method according to claim 1, wherein determining the similarity between the plurality of binary files according to the file feature values corresponding to the file features comprises:

comparing the file characteristic value corresponding to the unknown binary file with the file characteristic value corresponding to the known binary file, and determining the similarity between the file characteristic values;

and determining the similarity of the binary files according to the similarity between the characteristic values of the files.

11. The method of claim 1, further comprising:

after determining the function characteristics, determining function characteristic values corresponding to the function characteristics of the function;

and determining similarity between the functions according to the function characteristic value corresponding to the unknown function and the function characteristic value corresponding to the known function.

12. A computing device, comprising: a memory, a processor;

the memory for storing a computer program;

the processor executing the computer program to:

acquiring functions in a plurality of binary files;

13. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform the steps of the method of any one of claims 1-11.