CN109977976B - Executable file similarity detection method and device and computer equipment - Google Patents

Executable file similarity detection method and device and computer equipment Download PDF

Info

Publication number
CN109977976B
CN109977976B CN201711460533.8A CN201711460533A CN109977976B CN 109977976 B CN109977976 B CN 109977976B CN 201711460533 A CN201711460533 A CN 201711460533A CN 109977976 B CN109977976 B CN 109977976B
Authority
CN
China
Prior art keywords
executable file
files
executable
similarity
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711460533.8A
Other languages
Chinese (zh)
Other versions
CN109977976A (en
Inventor
罗元海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201711460533.8A priority Critical patent/CN109977976B/en
Publication of CN109977976A publication Critical patent/CN109977976A/en
Application granted granted Critical
Publication of CN109977976B publication Critical patent/CN109977976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for detecting similarity of executable files and computer equipment, wherein the method for detecting the similarity of the executable files comprises the following steps: acquiring a first executable file and a second executable file; disassembling the first executable file and the second executable file; extracting operation codes in the disassembled files of the first executable file and the second executable file, and coding the extracted operation codes; filtering the coded operation codes to obtain corresponding digit groups; calculating the similarity of the digit arrays corresponding to the first executable file and the second executable file; if the similarity is greater than a predetermined threshold, the first executable file and the second executable file are determined to be similar files. The method can better resist the interference introduced by virus authors or software plagiarism to the modification of the source code or the decompilation code, greatly improves the difficulty of the escape detection of the virus or plagiarism program, and greatly reduces the missing report or false report.

Description

Method and device for detecting similarity of executable files and computer equipment
Technical Field
The present application relates to the field of information security technologies, and in particular, to a method and an apparatus for detecting similarity of executable files, and a computer device.
Background
The traditional file similarity detection technology is based on source codes, and aiming at the situation that the source codes are difficult to obtain, an executable file comparison technology is proposed and receives more and more attention, and especially in the aspects of software plagiarism detection, piracy detection and virus detection, the similarity calculation of the executable files plays an important role.
In the related art, the methods for calculating the similarity of executable files are based on the original bytes or character strings of executable files, and the similarity is calculated by using a direct comparison or fuzzy hash algorithm.
On the one hand, however, the sensitivity of the similarity algorithm is too high by directly using the original bytes or character strings of the executable file, and the samples may be judged to be dissimilar as long as the samples slightly change, so that virus authors or software pirates can achieve the purpose of bypassing the detection by slightly modifying the source code or the decompiled code. On the other hand, the fuzzy hash function has a large compression degree on information, which may cause the sensitivity of the similarity algorithm to the difference of the executable file to be too low, resulting in a false report or a false report in practical applications. In addition, the time and space consumption can be very large if the data is not compressed for direct comparison.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides a method and a device for detecting the similarity of executable files and computer equipment.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, an embodiment of the present application provides a method for detecting similarity of executable files, including: acquiring a first executable file and a second executable file; disassembling the first executable file and the second executable file to respectively obtain disassembled files of the first executable file and the second executable file; extracting operation codes in the disassembled files of the first executable file and the second executable file, and coding the extracted operation codes; filtering the encoded operation codes to obtain bit arrays corresponding to the first executable file and the second executable file; calculating the similarity of the digit arrays corresponding to the first executable file and the second executable file; and if the similarity is larger than a preset threshold value, determining that the first executable file and the second executable file are similar files.
In the method for detecting the similarity of the executable files, after a first executable file and a second executable file are obtained, disassembling processing is carried out on the first executable file and the second executable file to obtain disassembled files of the first executable file and the second executable file respectively, then operation codes in the disassembled files of the first executable file and the second executable file are extracted, the extracted operation codes are coded, and the coded operation codes are filtered to obtain bit arrays corresponding to the first executable file and the second executable file; finally, the similarity of the corresponding digit array of the first executable file and the second executable file is calculated, if the similarity is larger than a preset threshold value, the first executable file and the second executable file are determined to be similar files, on one hand, the detection method of the similarity of the executable files considers the essential logic of the program, can better resist the interference introduced by the modification of a virus author or a software plagiarism on a source code or a decompilation code, and greatly improves the difficulty of the virus or the plagiarism program in escaping detection; on the other hand, the method for detecting the similarity of the executable file fully utilizes the excellent space efficiency of the bloom filter, so that the feature loss is small under the condition of higher calculation performance, and the missing report or false report is greatly reduced.
In a second aspect, an embodiment of the present application provides an apparatus for detecting similarity of executable files, including: the acquisition module is used for acquiring a first executable file and a second executable file; the disassembling module is used for disassembling the first executable file and the second executable file to respectively obtain disassembled files of the first executable file and the second executable file; the extraction module is used for extracting the operation codes in the disassembled files of the first executable file and the second executable file; the coding module is used for coding the operation codes extracted by the extraction module; the obtaining module is used for filtering the operation codes coded by the coding module to obtain bit arrays corresponding to the first executable file and the second executable file; the calculating module is used for calculating the similarity of the digit arrays corresponding to the first executable file and the second executable file; a determining module, configured to determine that the first executable file and the second executable file are similar files when the similarity calculated by the calculating module is greater than a predetermined threshold.
In the device for detecting the similarity of the executable files, after an acquisition module acquires a first executable file and a second executable file, a disassembly module performs disassembly processing on the first executable file and the second executable file to respectively obtain disassembly files of the first executable file and the second executable file, then an extraction module extracts operation codes in the disassembly files of the first executable file and the second executable file, an encoding module performs encoding processing on the extracted operation codes, and an acquisition module performs filtering processing on the encoded operation codes to obtain bit groups corresponding to the first executable file and the second executable file; finally, the calculating module calculates the similarity of the bit array corresponding to the first executable file and the second executable file, if the similarity is larger than a preset threshold value, the determining module determines that the first executable file and the second executable file are similar files, on one hand, the detection device of the similarity of the executable files considers the essential logic of the program, can better resist the interference of virus authors or software pirates on the modification of the source code or decompilation code, and greatly improves the difficulty of the escape detection of the virus or plagiarism program; on the other hand, the detection device for the similarity of the executable files fully utilizes the excellent space efficiency of the bloom filter, so that the feature loss is small under the condition of high calculation performance, and the missing report or the false report is greatly reduced.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the method described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method as described above.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart of an embodiment of a method for detecting similarity of executable files according to the present application;
FIG. 2 is a diagram illustrating an embodiment of disassembling an executable file according to the method for detecting similarity of executable files of the present application;
FIG. 3 is a flowchart of another embodiment of a method for detecting similarity of executable files according to the present application;
FIG. 4 is a schematic diagram of an embodiment of a bloom filter initially constructed in the method for detecting similarity of executable files of the present application;
FIG. 5 is a diagram illustrating an embodiment of a bloom filter bit array in the method for detecting similarity of executable files according to the present application;
FIG. 6 is a flowchart illustrating a method for detecting similarity of executable files according to yet another embodiment of the present application;
FIG. 7 is a flowchart illustrating a method for detecting similarity of executable files according to still another embodiment of the present disclosure;
FIG. 8 is a flowchart illustrating a method for detecting similarity of executable files according to yet another embodiment of the present application;
FIG. 9 is a schematic structural diagram illustrating an embodiment of an apparatus for detecting similarity of executable files according to the present application;
FIG. 10 is a schematic structural diagram of another embodiment of an apparatus for detecting similarity of executable files according to the present application;
FIG. 11 is a schematic block diagram of an embodiment of a computer apparatus of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the concepts of the application by those skilled in the art with reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Fig. 1 is a flowchart of an embodiment of a method for detecting similarity of executable files in the present application, and as shown in fig. 1, the method for detecting similarity of executable files may include:
step 101, a first executable file and a second executable file are obtained.
And 102, disassembling the first executable file and the second executable file to respectively obtain disassembled files of the first executable file and the second executable file.
Specifically, the format of the executable file is different according to different operating systems, for example, the executable file under the Windows operating system is in an exe format, the executable file under the Linux operating system is in an elf format, and the executable file under the Android operating system is in a dex format or an elf format. An executable file generally exists in a binary form, and an Interactive Disassembler (IDA) or other tools may be used to disassemble the executable file to obtain a disassembled file in an asm format, as shown in fig. 2, where fig. 2 is a schematic diagram of an embodiment of disassembling the executable file in the method for detecting similarity of executable files of the present application.
Step 103, extracting Operation codes (hereinafter referred to as Operation codes) in the disassembled files of the first executable file and the second executable file, and encoding the extracted Operation codes.
In particular, a typical assembly instruction may include an operation code (opcode) and/or a plurality of operands. The operation code may be represented by mnemonics such as "MOV", "PUSH", etc., and the operand may be a register, a constant, or a memory address, etc. In practical terms, the operands will have a certain randomness after recompiling, and may have a certain change according to different compiling and optimizing strategies, and the operation code is the part that can represent the semantics of the code and is also a relatively stable part, and generally will not change. Therefore, the present embodiment extracts the operation codes in the disassembled file of the first executable file and the disassembled file of the second executable file as the basis of the similarity calculation.
And then encoding the extracted operation code, wherein in specific implementation, there are various encoding methods, for example, directly using the original operation code, calculating hash or fuzzy hash on the operation code, and so on, as long as it is ensured that the codes generated by the same or similar functions of the two source codes are consistent, and the encoding method adopted in this embodiment is not limited.
And 104, filtering the coded operation codes to obtain bit arrays corresponding to the first executable file and the second executable file.
And 105, calculating the similarity of the bit arrays corresponding to the first executable file and the second executable file.
According to the steps, the respective digit groups of the first executable file and the second executable file of which the similarity is to be calculated can be obtained, so that the detection of the similarity of the two files is converted into the similarity calculation of the two digit groups, and the more similar the two files have more common 1 in the digit groups.
And step 106, if the similarity is greater than a preset threshold value, determining that the first executable file and the second executable file are similar files.
And if the similarity is less than or equal to the preset threshold value, determining that the first executable file and the second executable file are not similar files.
The predetermined threshold may be set according to system performance and/or implementation requirements, and the size of the predetermined threshold is not limited in this embodiment.
In the method for detecting the similarity of the executable files, after a first executable file and a second executable file are obtained, disassembling processing is carried out on the first executable file and the second executable file to obtain disassembled files of the first executable file and the second executable file respectively, then operation codes in the disassembled files of the first executable file and the second executable file are extracted, the extracted operation codes are coded, and the coded operation codes are filtered to obtain bit arrays corresponding to the first executable file and the second executable file; finally, the similarity of the bit array corresponding to the first executable file and the second executable file is calculated, if the similarity is larger than a preset threshold value, the first executable file and the second executable file are determined to be similar files, on one hand, the detection method of the similarity of the executable files considers the essential logic of the program, can better resist the interference introduced by the modification of the source code or the decompiler of virus authors or software plagiarisms, and greatly improves the difficulty of the escape detection of the virus or plagiarism programs; on the other hand, the method for detecting the similarity of the executable file fully utilizes the excellent space efficiency of the bloom filter, so that the feature loss is small under the condition of higher calculation performance, and the missing report or false report is greatly reduced.
Fig. 3 is a flowchart of another embodiment of the method for detecting similarity of executable files in the present application, and as shown in fig. 3, before step 104 in the embodiment shown in fig. 1 in the present application, the method may further include:
step 301, a bloom filter is constructed.
Step 301 and step 101 to step 103 may be executed in parallel or may be executed sequentially, the execution sequence of step 301 and step 101 to step 103 is not limited in this embodiment, and step 301 is executed after step 101 in fig. 3 as an example.
The bloom filter is a random data structure with high space efficiency, can express a set by using a bit array in a very concise way, and can judge whether an element belongs to the set or not.
A bloom filter is a bit array comprising m bits. To represent a set of n elements, S = { x1, x2, \8230;, xn }, the bloom filter uses k mutually independent hash functions that map each element in the set into the range of {1, \8230;, m }, respectively. For any element x, the location hi (x) mapped by the ith hash function is set to 1 (1 ≦ i ≦ k). Note that if a position is set to 1 a number of times, it will only work for the first time and the latter few times will have no effect. When judging whether y belongs to the set, applying a hash function k times to y, if the positions of all hi (y) are 1 (1 ≦ i ≦ k), determining that y is an element in the set, and otherwise, determining that y is not an element in the set.
The false positive rate analysis and parameter selection of the bloom filter are described as follows: in the representation of bloom filters, a certain position is set toA probability of 1 being
Figure BDA0001530171770000061
A probability of being set to 0 is ≥>
Figure BDA0001530171770000062
The hash function is performed kn times, so at the end of the operation, the probability that a bit is still 0 is:
Figure BDA0001530171770000063
therefore, the probability of misjudgment is:
Figure BDA0001530171770000064
for a given n and m, the derivation is made across equation (2), and
Figure BDA0001530171770000065
can find the result is>
Figure BDA0001530171770000066
The probability of time misjudgment is minimum, and at the moment:
Figure BDA0001530171770000067
assuming that the estimated total number of possible operation codes of each executable file is n and knowing the determined misjudgment rate, the bit number m of the bit array of the bloom filter can be obtained according to the formula (3), and then the number k of the hash function can be obtained according to m and n. So far, all parameters have been calculated, and after k hash functions are selected, the bloom filter is established. Finally, each bit of the bit array of the bloom filter is set to be 0, that is, initially, each bit of the bit array of the bloom filter is set to be 0, as shown in fig. 4, fig. 4 is a schematic diagram of an embodiment of the bloom filter initially constructed in the method for detecting similarity of executable files of the present application.
Thus, in step 104, the filtering process performed on the encoded operation code may be: and filtering the coded operation codes through a pre-constructed bloom filter to obtain bloom filter bit arrays corresponding to the first executable file and the second executable file.
In a specific implementation, the filtering process performed on the encoded operation code by the pre-constructed bloom filter may be: the encoded operation codes are all added to the pre-constructed bloom filter according to the algorithm of the bloom filter, so that a bloom filter bit array corresponding to the encoded operation codes can be obtained, as shown in fig. 5, where fig. 5 is a schematic diagram of an embodiment of the bloom filter bit array in the method for detecting similarity of executable files of the present application.
Fig. 6 is a flowchart of a further embodiment of the method for detecting similarity of executable files in the present application, as shown in fig. 6, in the embodiment shown in fig. 1 in the present application, step 103 may include:
step 601, fragmenting the disassembled files of the first executable file and the second executable file.
Specifically, the fragmenting the disassembled files of the first executable file and the second executable file may be: and fragmenting the disassembled files of the first executable file and the second executable file by taking the basic logic units of the disassembled files of the first executable file and the second executable file as units.
The most typical fragmentation method is in units of functions or basic blocks, because functions and basic blocks are the basic logical units of a disassembled file.
For each slice, the format of each row is: CODE: [ address ] [ opcode ] [ operand ].
Step 602, extracting an operation code in each slice of the disassembled file of the first executable file and the second executable file, and encoding the extracted operation code.
As described above, an opcode set may be obtained for the first executable file and the second executable file, respectively, the elements of the set being an encoding of an opcode sequence for a fragment of the first executable file and the second executable file disassembly file.
Fig. 7 is a flowchart of a further embodiment of the method for detecting similarity of executable files in the present application, as shown in fig. 7, in the embodiment shown in fig. 1 in the present application, step 105 may be:
step 701, calculating a Hamming distance of a bit array corresponding to the first executable file and the second executable file.
In this embodiment, the bit array corresponding to the first executable file and the second executable file may be a bloom filter bit array corresponding to the first executable file and the second executable file. The bloom filter bit array may measure the similarity by using hamming distance (hamming _ distance), or may detect the similarity by using methods such as Cosine, overlap, dice, or Jaccard. Here, assuming that the bloom filter bit array corresponding to the first executable file is bit array a and the bloom filter bit array corresponding to the second executable file is bit array B, the hamming distance between bit array a and bit array B is the number of 1's in the binary result obtained by xoring (xor) B.
Examples are as follows:
A=100111;
B=101010;
hamming_distance(A,B)=count_1(A xor B)=count_1(001101)=3;
that is, the hamming distance between bit array a and bit array B is 3.
On one hand, the method for detecting the similarity of the executable file can better resist the interference introduced by the modification of a source code or a decompiler code by a virus author or a software pirate by considering the essential logic of a program, thereby greatly improving the difficulty of the virus or the pirate program in escaping detection; on the other hand, the excellent space efficiency of the bloom filter is fully utilized, so that the feature loss is small under the condition of higher calculation performance, and the missing report or false report is greatly reduced.
Fig. 8 is a schematic diagram of another embodiment of the method for detecting similarity of executable files according to the present application, and as shown in fig. 8, the method for detecting similarity of executable files includes:
step 801, a first executable file and a second executable file are obtained.
Step 802, a bloom filter is constructed.
Step 803, disassembling the first executable file and the second executable file to obtain disassembled files of the first executable file and the second executable file respectively.
And step 804, extracting operation codes in the disassembled files of the first executable file and the second executable file, and coding the extracted operation codes.
Step 805, filtering the encoded operation code through a pre-constructed bloom filter to obtain bloom filter bit arrays corresponding to the first executable file and the second executable file.
Step 806, calculating the similarity of the bit array corresponding to the first executable file and the second executable file.
Step 807, comparing the similarity with a predetermined threshold, and if the similarity is greater than the predetermined threshold, determining that the first executable file and the second executable file are similar files; and if the similarity is less than or equal to a preset threshold value, determining that the first executable file and the second executable file are not similar files.
Fig. 9 is a schematic structural diagram of an embodiment of an apparatus for detecting similarity of executable files in the present application, where the apparatus for detecting similarity of executable files in the present application may be implemented as a computer device, or a part of a computer device, to implement the method for detecting similarity of executable files provided in the present application.
The Computer device may be a terminal device or a server, and the form of the Computer device is not limited in this embodiment, and in this embodiment, the terminal device may be a Personal Computer (Personal Computer; hereinafter, referred to as a PC) or a notebook Computer.
As shown in fig. 9, the executable file similarity detection apparatus may include: an acquisition module 91, a disassembly module 92, an extraction module 93, an encoding module 94, an acquisition module 95, a calculation module 96 and a determination module 97.
The acquiring module 91 is configured to acquire a first executable file and a second executable file;
the disassembling module 92 is configured to perform disassembling processing on the first executable file and the second executable file to obtain disassembling files of the first executable file and the second executable file, respectively.
Specifically, the format of the executable file is different according to different operating systems, for example, the executable file under the Windows operating system is in an exe format, the executable file under the Linux operating system is in an elf format, and the executable file under the Android operating system is in a dex format or an elf format. The executable file typically exists in binary form, and may be disassembled using tools such as IDA, resulting in a disassembled file in asm format, as shown in fig. 2.
An extracting module 93, configured to extract the operation codes in the disassembled files of the first executable file and the second executable file;
an encoding module 94, configured to encode the operation code extracted by the extraction module 93;
in particular, a typical assembly instruction may include an operation code (opcode) and/or a plurality of operands. The operation code may be represented by mnemonics such as "MOV" and "PUSH", and the operand may be a register, a constant, or a memory address. In practical terms, the operands will have a certain randomness after recompiling, and may have a certain change according to different compiling and optimizing strategies, and the operation code is the part that can represent the semantics of the code and is also a relatively stable part, and generally will not change. Thus, the extraction module 93 extracts the opcodes in the disassembled file of the first executable file and the disassembled file of the second executable file as the basis for the similarity calculation.
Then, the encoding module 94 encodes the extracted operation code, and in a specific implementation, there are various encoding methods, for example, directly using the original operation code, calculating hash or fuzzy hash on the operation code, and so on, as long as it is ensured that the codes generated by the same or similar functions of the two source codes are consistent, and the encoding method adopted by the encoding module 94 is not limited in this embodiment.
An obtaining module 95, configured to filter the operation code encoded by the encoding module 94 to obtain bit arrays corresponding to the first executable file and the second executable file;
a calculating module 96, configured to calculate similarity between bit arrays corresponding to the first executable file and the second executable file; specifically, after obtaining the respective bit arrays of the first executable file and the second executable file for calculating the similarity, the detection of the similarity of the two files is converted into the similarity calculation of the two bit arrays, and the more similar two files have more common 1 s in the bit arrays.
A determining module 97, configured to determine that the first executable file and the second executable file are similar files when the similarity calculated by the calculating module 96 is greater than a predetermined threshold.
If the similarity is less than or equal to the predetermined threshold, the determining module 97 determines that the first executable file and the second executable file are not similar files.
The predetermined threshold may be set according to system performance and/or implementation requirements, and the size of the predetermined threshold is not limited in this embodiment.
In the device for detecting similarity of executable files, after an obtaining module 91 obtains a first executable file and a second executable file, a disassembling module 92 performs disassembling processing on the first executable file and the second executable file to obtain disassembling files of the first executable file and the second executable file respectively, an extracting module 93 extracts operation codes in the disassembling files of the first executable file and the second executable file, an encoding module 94 encodes the extracted operation codes, and an obtaining module 95 filters the encoded operation codes to obtain a bit array corresponding to the first executable file and the second executable file; finally, the calculating module 96 calculates the similarity of the bit array corresponding to the first executable file and the second executable file, if the similarity is larger than the preset threshold value, the determining module 97 determines that the first executable file and the second executable file are similar files, on one hand, the detection device of the similarity of the executable files considers the essential logic of the program, can better resist the interference of virus authors or software plagiarisms on the modification of the source code or decompilation code, and greatly improves the difficulty of the virus or plagiarism program to escape detection; on the other hand, the detection device for the similarity of the executable files fully utilizes the excellent space efficiency of the bloom filter, so that the feature loss is small under the condition of high calculation performance, and the missing report or the false report is greatly reduced.
Fig. 10 is a schematic structural diagram of another embodiment of the device for detecting similarity of executable files according to the present application, which is different from the device for detecting similarity of executable files shown in fig. 9 in that the device for detecting similarity of executable files shown in fig. 10 may further include:
a building module 98 for building a bloom filter.
In this embodiment, before the obtaining module 95 performs filtering processing on the operation code encoded by the encoding module 94 to obtain the bit array corresponding to the first executable file and the second executable file, the constructing module 98 needs to construct a bloom filter. The bloom filter is a random data structure with high space efficiency, and can express a set by using a bit array in a very concise way and judge whether an element belongs to the set or not.
A bloom filter is a bit array comprising m bits. To represent a set of n elements, S = { x1, x2, \8230;, xn }, the bloom filter uses k mutually independent hash functions that map each element in the set into the range of {1, \8230;, m }, respectively. For any element x, the location hi (x) mapped by the ith hash function is set to 1 (1 ≦ i ≦ k). Note that if a position is set to 1 a number of times, it will only work for the first time and the latter few times will have no effect. When judging whether y belongs to the set, applying a hash function k times to y, if the positions of all hi (y) are 1 (1 ≦ i ≦ k), determining that y is an element in the set, and otherwise, determining that y is not an element in the set.
The misjudgment rate analysis and parameter selection of the bloom filter are described as follows: in the representation method of the bloom filter, the probability that a certain position is set to 1 is
Figure BDA0001530171770000101
A probability of being set to 0 is%>
Figure BDA0001530171770000102
The hash function is performed kn times, so at the end of the operation, the probability that a bit is still 0 is:
Figure BDA0001530171770000103
the probability of a false positive is:
Figure BDA0001530171770000111
for a given n and m, the two ends of equation (2) are differentiated, such that
Figure BDA0001530171770000112
Can find the result is>
Figure BDA0001530171770000113
The probability of time misjudgment is minimum, and at the moment: />
Figure BDA0001530171770000114
Assuming that the estimated total number of codes of the operation codes possibly appearing in each executable file is n, and knowing the determined misjudgment rate, the bit number m of the bit array of the bloom filter can be obtained according to the formula (3), and then the number k of the hash functions can be obtained according to m and n. So far, all parameters have been calculated, and after k hash functions have been selected, a bloom filter is established. Finally, each bit of the bit array of the bloom filter is set to 0, that is, initially, each bit of the bit array of the bloom filter has a value of 0, as shown in fig. 4.
In this way, the obtaining module 95 is specifically configured to filter the encoded operation code through the bloom filter that is pre-constructed by the constructing module 98, so as to obtain the bloom filter bit arrays corresponding to the first executable file and the second executable file.
In a specific implementation, the obtaining module 95 may add all the encoded operation codes to the pre-constructed bloom filter according to the algorithm of the bloom filter, and may obtain a bit array of the bloom filter corresponding to the encoded operation codes, as shown in fig. 5.
In this embodiment, the extracting module 93 may include: a fragmentation sub-module 931 and an operation code extraction sub-module 932;
the fragmentation submodule 931 is configured to fragment the disassembled files of the first executable file and the second executable file; in this embodiment, the fragmentation sub-module 931 is specifically configured to fragment the disassembled files of the first executable file and the second executable file by taking the basic logic unit of the disassembled files of the first executable file and the second executable file as a unit.
The most typical fragmentation method is in units of functions or basic blocks, because functions and basic blocks are the basic logical units of a disassembled file.
For each slice, the format of each row is: CODE: [ address ] [ opcode ] [ operand ].
An operation code extraction sub-module 932 for extracting an operation code in each slice of the disassembled file of the first executable file and the second executable file.
As described above, an opcode set may be obtained for the first executable file and the second executable file, respectively, the elements of the set being an encoding of an opcode sequence for a fragment of the first executable file and the second executable file disassembly file.
In this embodiment, the calculating module 96 is specifically configured to calculate a hamming distance between bit arrays corresponding to the first executable file and the second executable file.
In this embodiment, the bit array corresponding to the first executable file and the second executable file may be a bloom filter bit array corresponding to the first executable file and the second executable file. The bloom filter bit array may measure the similarity by using hamming distance (hamming _ distance), or may detect the similarity by using methods such as Cosine, overlap, dice, or Jaccard, and the like. Here, assuming that the bloom filter bit array corresponding to the first executable file is bit array a and the bloom filter bit array corresponding to the second executable file is bit array B, the hamming distance between bit array a and bit array B is the number of 1's in the binary result obtained by xoring (xor) B.
Examples are as follows:
A=100111;
B=101010;
hamming_distance(A,B)=count_1(A xor B)=count_1(001101)=3;
that is, the hamming distance between bit array a and bit array B is 3.
On one hand, the device for detecting the similarity of the executable files can better resist the interference introduced by the modification of the source code or the decompiler code by virus authors or software plagiarisms in consideration of the essential logic of the program, thereby greatly improving the difficulty of the escape detection of the virus or plagiarism program; on the other hand, the excellent space efficiency of the bloom filter is fully utilized, so that the characteristic loss is small under the condition of high calculation performance, and the missing report or the false report is greatly reduced.
Fig. 11 is a schematic structural diagram of an embodiment of a computer device according to the present application, where the computer device in the embodiment may include a memory, a processor, and a computer program that is stored in the memory and is executable on the processor, and when the processor executes the computer program, the method for detecting similarity of executable files according to the embodiment of the present application may be implemented.
The Computer device may be a terminal device or a server, and the form of the Computer device is not limited in this embodiment, and in this embodiment, the terminal device may be a Personal Computer (Personal Computer; hereinafter, referred to as a PC) or a notebook Computer.
FIG. 11 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 11 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.
As shown in FIG. 11, computer device 12 is embodied in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro Channel Architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the embodiments described herein.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public Network such as the Internet via Network adapter 20. As shown in FIG. 11, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be understood that although not shown in FIG. 11, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
The processing unit 16 executes various functional applications and data processing by running the program stored in the system memory 28, for example, implementing the executable file similarity detection method provided in the embodiment of the present application.
The embodiment of the present application provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for detecting similarity of executable files provided in the embodiment of the present application.
The non-transitory computer readable storage medium described above may take any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for implementing a logic function on a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (6)

1. A method for detecting similarity of executable files is characterized by comprising the following steps:
acquiring a first executable file and a second executable file;
disassembling the first executable file and the second executable file to respectively obtain disassembled files of the first executable file and the second executable file;
taking the basic logic units of the disassembled files of the first executable file and the second executable file as units, and fragmenting the disassembled files of the first executable file and the second executable file;
extracting operation codes in each fragment of the disassembled file of the first executable file and the second executable file, and coding the extracted operation codes;
constructing a bloom filter;
adding the coded operation codes into a pre-constructed bloom filter according to the algorithm of the bloom filter to obtain bloom filter bit arrays corresponding to the first executable file and the second executable file;
calculating the similarity of the digit arrays corresponding to the first executable file and the second executable file;
if the similarity is larger than a preset threshold value, the first executable file and the second executable file are determined to be similar files.
2. The method of claim 1, wherein calculating the similarity of the bit arrays corresponding to the first executable file and the second executable file comprises:
and calculating the Hamming distance of the bit array corresponding to the first executable file and the second executable file.
3. An apparatus for detecting similarity of executable files, comprising:
the acquisition module is used for acquiring a first executable file and a second executable file;
the disassembling module is used for disassembling the first executable file and the second executable file to respectively obtain disassembling files of the first executable file and the second executable file;
the fragmentation submodule is used for fragmenting the disassembled files of the first executable file and the second executable file by taking a basic logic unit of the disassembled files of the first executable file and the second executable file as a unit;
an operation code extraction sub-module, configured to extract an operation code in each slice of a disassembled file of the first executable file and the second executable file;
the coding module is used for coding the operation codes extracted by the operation code extraction submodule;
the building module is used for building the bloom filter;
an obtaining module, configured to add the encoded operation code to a pre-constructed bloom filter according to an algorithm of the bloom filter to obtain a bloom filter bit array corresponding to the first executable file and the second executable file;
the calculating module is used for calculating the similarity of the bit arrays corresponding to the first executable file and the second executable file;
a determining module, configured to determine that the first executable file and the second executable file are similar files when the similarity calculated by the calculating module is greater than a predetermined threshold.
4. The apparatus of claim 3,
the calculation module is specifically configured to calculate a hamming distance between bit arrays corresponding to the first executable file and the second executable file.
5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of claim 1 or 2 when executing the computer program.
6. A computer readable storage medium, characterized in that it stores at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method according to claim 1 or 2.
CN201711460533.8A 2017-12-28 2017-12-28 Executable file similarity detection method and device and computer equipment Active CN109977976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711460533.8A CN109977976B (en) 2017-12-28 2017-12-28 Executable file similarity detection method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711460533.8A CN109977976B (en) 2017-12-28 2017-12-28 Executable file similarity detection method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN109977976A CN109977976A (en) 2019-07-05
CN109977976B true CN109977976B (en) 2023-04-07

Family

ID=67074677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711460533.8A Active CN109977976B (en) 2017-12-28 2017-12-28 Executable file similarity detection method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN109977976B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078227B (en) * 2019-12-13 2021-08-31 中国科学院信息工程研究所 Binary code and source code similarity analysis method and device based on code characteristics
CN112073444B (en) * 2020-11-16 2021-02-05 支付宝(杭州)信息技术有限公司 Data set processing method and device and server
CN113515749A (en) * 2021-07-12 2021-10-19 国网山东省电力公司电力科学研究院 Firmware security evaluation method and system
CN113656809A (en) * 2021-09-01 2021-11-16 京东科技信息技术有限公司 Mirror image security detection method, device, equipment and medium
CN114385922A (en) * 2022-01-17 2022-04-22 上海阿法迪智能数字科技股份有限公司 Library system knowledge recommendation method based on bloom filter

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060259900A1 (en) * 2005-05-12 2006-11-16 Xerox Corporation Method for creating unique identification for copies of executable code and management thereof
CN101315599A (en) * 2007-05-29 2008-12-03 北京航空航天大学 Method and device for detecting similarity of source codes
CN104679495B (en) * 2013-12-02 2018-04-27 北京猎豹移动科技有限公司 software identification method and device
US9792433B2 (en) * 2013-12-30 2017-10-17 Beijing Qihoo Technology Company Limited Method and device for detecting malicious code in an intelligent terminal
CN106909844A (en) * 2015-12-22 2017-06-30 北京奇虎科技有限公司 The sorting technique and device of a kind of application program sample
CN107480522B (en) * 2017-08-14 2020-05-08 苏州浪潮智能科技有限公司 ELF file execution control system and method

Also Published As

Publication number Publication date
CN109977976A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977976B (en) Executable file similarity detection method and device and computer equipment
KR102582580B1 (en) Electronic Apparatus for detecting Malware and Method thereof
CN110119643B (en) Two-dimensional code generation method and device and two-dimensional code identification method and device
WO2015101097A1 (en) Method and device for feature extraction
KR101337874B1 (en) System and method for detecting malwares in a file based on genetic map of the file
CN103605950B (en) Method and system for hiding signature in credible two-dimensional code
CN109960932B (en) File detection method and device and terminal equipment
CN112005532B (en) Method, system and storage medium for classifying executable files
US20140150101A1 (en) Method for recognizing malicious file
CN106709350B (en) Virus detection method and device
CN110495152A (en) The malware detection in based on the character string that existing computer generates
EP3087527B1 (en) System and method of detecting malicious multimedia files
US9100042B2 (en) High throughput decoding of variable length data symbols
CN107844702B (en) Website trojan backdoor detection method and device based on cloud protection environment
CN113360902B (en) shellcode detection method and device, computer equipment and computer storage medium
CN110832488A (en) Normalizing entry point instructions in executable program files
US9348535B1 (en) Compression format designed for a very fast decompressor
CN113111350A (en) Malicious PDF file detection method and device and electronic equipment
CN114491621B (en) Text object security detection method and equipment
KR102662965B1 (en) Apparatus and method for detecting ai based malignant code in structured document
CN111143849B (en) File type identification method and device applied to electronic equipment and electronic equipment
de Souza et al. Inference of Endianness and Wordsize From Memory Dumps
CN110674497B (en) Malicious program similarity calculation method and device
CN113961924A (en) Malicious software identification method and device, terminal equipment and storage medium
CN111783095A (en) Method and device for identifying malicious code of applet and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant