CN112989432A - File signature extraction method and device - Google Patents

File signature extraction method and device Download PDF

Info

Publication number
CN112989432A
CN112989432A CN201911295341.5A CN201911295341A CN112989432A CN 112989432 A CN112989432 A CN 112989432A CN 201911295341 A CN201911295341 A CN 201911295341A CN 112989432 A CN112989432 A CN 112989432A
Authority
CN
China
Prior art keywords
binary
binary content
software
content
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911295341.5A
Other languages
Chinese (zh)
Inventor
鞠全永
朱晓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201911295341.5A priority Critical patent/CN112989432A/en
Publication of CN112989432A publication Critical patent/CN112989432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6209Protecting access to data via a platform, e.g. using keys or access control rules to a single file or object, e.g. in a secure envelope, encrypted and accessed using a key, or with access control rules appended to the object itself

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The application provides a file signature extraction method and a file signature extraction device, wherein the method comprises the following steps: extracting at least two sections of binary contents from a non-malicious binary software set and a malicious binary software set, then obtaining at least one section of binary contents from the binary contents as candidate binary contents according to first-class statistical indexes of the at least two sections of binary contents in the two sets respectively, further obtaining target binary contents from the candidate binary contents according to second-class statistical indexes of the candidate binary contents in the two sets respectively, and further obtaining signatures of the malicious binary software according to the target binary contents, thereby realizing that partial binary contents of the malicious software are extracted by using an automatic technology as signatures for identifying the malicious software, solving the problems of overlarge workload of malicious software analysts and low signature extraction efficiency, and being not influenced by personal experience and subjective factors of analysts, the extraction accuracy of the malicious software signature is improved.

Description

File signature extraction method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting a file signature.
Background
With the rapid development of the internet, the network security problem gradually becomes prominent, wherein malicious software represented by trojans, viruses, backdoor programs, advertisement software and the like has a rapid development in the aspects of quantity, updating speed, using technology and the like, and the influence and loss on internet users are increased year by year.
To address the above-mentioned problems, malware is generally identified based on signatures. Generally, a malware analyst extracts the content of a character string, assembly instructions, and the like of malware as a signature for identifying the malware based on the research on the malware.
However, the amount of malware grows very rapidly, and when the amount of malware that a malware analyst can analyze is orders of magnitude less than the amount of malware that requires manual reverse engineering to identify signatures, the speed of analysis is much slower than the rate of malware growth. The workload of malicious software analysts is too large, the efficiency of extracting the signature is low, and the requirements cannot be met. Moreover, the manual analysis and signature extraction are greatly influenced by personal experience and energy concentration of an analyst, and the problem of low accuracy also exists.
Disclosure of Invention
The application provides a file signature extraction method and device, and aims to solve the problem of low efficiency when a manual analysis method is adopted to extract a malicious software signature.
In a first aspect, an embodiment of the present application provides a file signature extraction method, which may be executed by an analysis device, and the method includes the following steps: first, at least two pieces of binary content are extracted from a first software set and a second software set, wherein the first software set comprises a first number of non-malicious binary software, and the second software set comprises a second number of malicious binary software. Herein, the malware includes trojans, viruses, backdoor programs, advertisement software, etc., and the non-malware may also be referred to as normal binary software. The first number and the second number may be set according to actual situations, and are not particularly limited in the embodiments of the present application. Secondly, obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to a first type statistical index of each section of binary content in the at least two sections of binary content in a first software set and the first type statistical index in a second software set, wherein the first type statistical index comprises at least one of an occurrence frequency and an occurrence probability. And obtaining at least one section of binary content from the candidate binary content as target binary content according to a second type statistical index of each section of binary content in the candidate binary content in the first software set and the second type statistical index in the second software set, wherein the second type statistical index comprises at least one of software coverage ratio and set similarity. And finally, obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
The embodiment of the application extracts at least two sections of binary contents from a first software set and a second software set, then obtains at least one section of binary contents from the at least two sections of binary contents as candidate binary contents according to first-class statistical indexes of the at least two sections of binary contents in the first software set and the first-class statistical indexes in the second software set respectively, further obtains target binary contents from the candidate binary contents according to second-class statistical indexes of the candidate binary contents in the first software set and the second-class statistical indexes in the second software set respectively, and further obtains corresponding signatures of malicious binary software according to the target binary contents, thereby realizing that partial binary contents of malicious software are extracted as signatures for identifying the malicious software by utilizing an automatic technology without depending on manual analysis, the problem that the workload of malicious software analysts is too large and the efficiency of extracting the signature is low is solved, the efficiency of extracting the signature of the malicious software is improved, and the application requirements are met. In addition, the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of an analyst, and the extraction accuracy of the signature of the malicious software is improved to a certain extent.
One possible design, where the extracting at least two pieces of binary content from the first software set and the second software set includes:
and respectively extracting binary contents with preset lengths for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
For example, a window with a size of k bytes may be set, the sliding direction is from left to right along the binary file (i.e. sliding from a low address to a high address in the storage space occupied by the binary file), and the displacement of each sliding is one byte, where k is a natural number greater than 1. And for each binary file in the first software set and the second software set, sliding the window from the left to the right, sliding one unit at a time, and extracting the binary content with the size of k.
A possible design, when the first type of statistical indicator includes the occurrence frequency, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, includes:
obtaining the binary content with the occurrence frequency lower than a first preset frequency threshold value in a first software set from the at least two sections of binary content as a first binary content;
obtaining the appearance frequency of each binary content in the first binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the occurrence frequency, according to the occurrence frequency of the at least two pieces of binary content in non-malicious binary software in the first software set and the occurrence frequency of the at least two pieces of binary content in malicious binary software in the second software set, binary content with low occurrence frequency in non-malicious binary software and high occurrence frequency in malicious binary software is selected from the at least two pieces of binary content and is used as candidate binary content.
A possible design, when the first type of statistical indicator includes the occurrence probability, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each piece of binary content in the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, includes:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in a first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold is smaller than the second preset probability threshold.
Illustratively, if the first type of statistical indicator includes the occurrence probability, the binary content with low occurrence probability in non-malware and high occurrence probability in malware is selected from the at least two pieces of binary content as candidate binary content according to the occurrence probability of the at least two pieces of binary content in non-malware of the first software set and the occurrence probability of the at least two pieces of binary content in malware of the second software set.
A possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two sections of binary content in a first software set and the first type of statistical indicator in a second software set, respectively, includes:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in a first software set and the occurrence frequency of each section of binary content in a second software set, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each binary content in the first binary content to be processed in the first software set from the occurrence probability of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence probability of each binary content in the first binary content to be processed in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary content in the first binary contents to be processed in the first software set and the occurrence probability of each binary content in the second software set.
Here, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content as a first binary content to be processed according to the occurrence frequency, and further, candidate binary content may be obtained from the first binary content to be processed according to the occurrence probability.
Similarly, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one section of binary content may be obtained from the at least two sections of binary content according to the occurrence probability to serve as a second binary content to be processed, and further, a candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Specifically, when the first-class statistical indicator includes the occurrence frequency and the occurrence probability, obtaining at least one segment of binary content from the at least two segments of binary content according to the first-class statistical indicator of each segment of binary content in the at least two segments of binary content in the first software set and the first-class statistical indicator in the second software set, and using the at least one segment of binary content as a candidate binary content includes:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability in the second software set, obtaining at least one binary content from the at least two binary contents as a second binary content to be processed;
obtaining the occurrence frequency of each binary content in the second binary content to be processed in the first software set from the occurrence frequency of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence frequency of each binary content in the second binary content to be processed in the second software set from the occurrence frequency of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set.
A possible design, when the second type of statistical indicator includes the software coverage ratio, obtaining at least one piece of binary content from the candidate binary content as a target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, including:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in a first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage ratio of each binary content in the seventh binary content in the second software set from the software coverage ratio of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting binary contents with low software coverage ratio in non-malware and high software coverage ratio in malware from the candidate binary contents as target binary contents according to the software coverage ratio of the candidate binary contents in non-malware binary software of the first software set and the software coverage ratio in malware of the second software set.
A possible design, when the second type of statistical indicator includes the set similarity, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, including:
obtaining binary contents with the set similarity lower than a first preset similarity threshold in the first software set from the candidate binary contents as eighth binary contents;
acquiring the set similarity of each binary content in the eighth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the set similarity higher than a second preset similarity threshold in the second software set from the eighth binary content as the target binary content, wherein the first preset similarity threshold is smaller than the second preset similarity threshold.
Illustratively, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting binary contents with low set similarity in the non-malicious binary contents and high software set similarity in the malicious binary contents as the target binary contents.
A possible design, when the second type of statistical indicator includes the software coverage ratio and the set similarity, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, including:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the candidate binary content in the second software set;
acquiring set similarity of each segment of binary content in the first characteristic binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and acquiring set similarity of each segment of binary content in the first characteristic binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and acquiring target binary contents from the first characteristic binary contents according to the set similarity of each section of binary contents in the first characteristic binary contents in the first software set and the set similarity of each section of binary contents in the second software set.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content as the first characteristic binary content according to the software coverage ratio, and further, the target binary content may be obtained from the first characteristic binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content as a second characteristic binary content according to the set similarity, and further, a target binary content may be obtained from the second characteristic binary content according to the software coverage ratio.
Specifically, when the second type of statistical indicator includes the software coverage ratio and the set similarity, obtaining at least one section of binary content from the candidate binary content according to the second type of statistical indicator of each section of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, as the target binary content, including:
acquiring at least one section of binary content from the candidate binary content as second characteristic binary content according to the set similarity of each section of binary content in the candidate binary content in the first software set and the set similarity of each section of binary content in the second software set;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each piece of binary content in the second characteristic binary content in the first software set and the software coverage proportion of each piece of binary content in the second software set.
In addition, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining information entropy values respectively corresponding to each section of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each section of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value, the more random the content of the context is, and the lower the information entropy value, the more consistent the content of the context is. And if the information entropy value of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold value, which indicates that the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, deleting the candidate binary content (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set). Here, the preset entropy threshold may be set according to actual conditions, and the application is not particularly limited in this regard.
In one possible design, if at least two pieces of binary content are included in the candidate binary content, the method further includes, after obtaining the at least two pieces of binary content from the binary content as the candidate binary content:
determining an importance ranking rule according to attribute information corresponding to each section of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following: a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, a probability of occurrence of the candidate binary content in the first software set, a probability of occurrence of the candidate binary content in the second software set, a location of the candidate binary content at a first occurrence in each binary software of the first software set and a mean, variance and entropy of the location, a location of the candidate binary content at a first occurrence in each binary software of the second software set and a mean, variance and entropy of the location, and printable characters of the candidate binary content;
according to the importance ranking rule, ranking at least two sections of binary contents in the candidate binary contents according to the importance degree from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
The method comprises the steps of determining an importance ordering rule according to attribute information of the candidate binary contents, ordering the candidate binary contents from high importance to low importance according to the importance ordering rule, determining an ordering result of the target binary contents according to the ordering result of the candidate binary contents, and deleting the binary contents with ordering serial numbers behind preset serial numbers from the target binary contents, so that signatures of corresponding malicious binary software can be generated according to the binary contents with higher importance, and the generated signatures can identify the malicious binary software more accurately.
In a second aspect, an embodiment of the present application provides a file signature extracting apparatus, where the apparatus includes:
the system comprises an extraction module, a storage module and a processing module, wherein the extraction module is used for extracting at least two sections of binary contents from a first software set and a second software set, the first software set comprises a first number of non-malicious binary softwares, and the second software set comprises a second number of malicious binary softwares;
a first obtaining module, configured to obtain at least one segment of binary content from the at least two segments of binary content as candidate binary content according to a first class statistical indicator of each of the at least two segments of binary content in a first software set and a first class statistical indicator of each of the at least two segments of binary content in a second software set, where the first class statistical indicator includes at least one of an occurrence frequency and an occurrence probability;
a second obtaining module, configured to obtain at least one piece of binary content from the candidate binary content as a target binary content according to a second type of statistical indicator of each piece of binary content in the candidate binary content in a first software set and a second type of statistical indicator in a second software set, where the second type of statistical indicator includes at least one of a software coverage ratio and a set similarity;
and the third obtaining module is used for obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
In one possible design, the extraction module is specifically configured to:
and respectively extracting binary contents with preset lengths for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
In one possible design, when the first type of statistical indicator includes the occurrence frequency, the first obtaining module is specifically configured to:
obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the appearance frequency of each binary content in the first binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
In one possible design, when the first type of statistical indicator includes the occurrence probability, the first obtaining module is specifically configured to:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in the first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold is smaller than the second preset probability threshold.
In one possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module is specifically configured to:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in a first software set and the occurrence frequency of each section of binary content in a second software set, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each binary content in the first binary content to be processed in the first software set from the occurrence probability of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence probability of each binary content in the first binary content to be processed in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents in the first binary contents to be processed in the first software set and the occurrence probability of each binary contents in the second software set.
In one possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module is specifically configured to:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability in the second software set, obtaining at least one binary content from the at least two binary contents as a second binary content to be processed;
obtaining the occurrence frequency of each binary content in the second binary content to be processed in the first software set from the occurrence frequency of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence frequency of each binary content in the second binary content to be processed in the second software set from the occurrence frequency of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set.
In one possible design, when the second type of statistical indicator includes the software coverage ratio, the second obtaining module is specifically configured to:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage ratio of each binary content in the seventh binary content in the second software set from the software coverage ratio of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
In one possible design, when the second type of statistical indicator includes the set similarity, the second obtaining module is specifically configured to:
obtaining binary contents with the set similarity lower than a first preset similarity threshold in the first software set from the candidate binary contents as eighth binary contents;
acquiring the set similarity of each binary content in the eighth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the set similarity higher than a second preset similarity threshold in the second software set from the eighth binary content as the target binary content, wherein the first preset similarity threshold is smaller than the second preset similarity threshold.
A possible design is that, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module is specifically configured to:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the candidate binary content in the second software set;
acquiring set similarity of each segment of binary content in the first characteristic binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and acquiring set similarity of each segment of binary content in the first characteristic binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the first characteristic binary content according to the set similarity of each section of binary content in the first characteristic binary content in the first software set and the set similarity of each section of binary content in the second software set.
A possible design is that, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module is specifically configured to:
acquiring at least one section of binary content from the candidate binary content as second characteristic binary content according to the set similarity of each section of binary content in the candidate binary content in the first software set and the set similarity of each section of binary content in the second software set;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each piece of binary content in the second characteristic binary content in the first software set and the software coverage proportion of the second software set.
In one possible design, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the second obtaining module is further configured to:
determining information entropy values respectively corresponding to each segment of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each segment of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
In one possible design, if at least two pieces of binary content are included in the candidate binary content, the first obtaining module is further configured to obtain at least two pieces of binary content from the binary content, and after the candidate binary content is obtained, the first obtaining module is further configured to:
determining an importance ranking rule according to attribute information corresponding to each section of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following: a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, a probability of occurrence of the candidate binary content in the first software set, a probability of occurrence of the candidate binary content in the second software set, a location of the candidate binary content at a first occurrence in each binary software of the first software set and a mean, variance and entropy of the location, a location of the candidate binary content at a first occurrence in each binary software of the second software set and a mean, variance and entropy of the location, and printable characters of the candidate binary content;
according to the importance ranking rule, ranking at least two sections of binary contents in the candidate binary contents according to the importance degree from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
In a third aspect, the present application provides a computing device comprising a processor and a memory. The memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform the method provided by the first aspect or the various possible designs of the first aspect, to cause the computing device to deploy the file signature extraction apparatus provided by the second aspect or the various possible designs of the second aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer instructions that instruct a computing device to perform the method provided by the first aspect or the various possible designs of the first aspect, or instruct the computing device to deploy the file signature extraction apparatus provided by the second aspect or the various possible designs of the second aspect.
In a fifth aspect, the present application provides a computer program product comprising computer instructions. Optionally, the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computing device from a computer-readable storage medium, the processor executing the computer instructions to cause the computing device to perform the method provided by the first aspect or the various possible designs of the first aspect, to cause the computing device to deploy the file signature extraction apparatus provided by the second aspect or the various possible designs of the second aspect.
In a sixth aspect, an embodiment of the present application provides a chip, which includes a memory and a processor, where the memory is used to store computer instructions, and the processor is used to call and execute the computer instructions from the memory, so as to perform the method in the first aspect and any possible implementation manner of the first aspect.
Drawings
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic diagram of another application scenario provided in the embodiment of the present application;
fig. 3 is a schematic diagram of another application scenario provided in an embodiment of the present application;
fig. 4 is a schematic flowchart of a file signature extraction method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of another file signature extraction method according to an embodiment of the present application;
fig. 6 is a schematic flowchart of another file signature extraction method according to an embodiment of the present application;
fig. 7 is a schematic flowchart of another file signature extraction method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a document signature extraction apparatus provided in the present application;
fig. 9 is a schematic diagram of a basic hardware architecture of a computing device provided in the present application.
Detailed Description
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the invention are explained in detail with reference to the drawings. In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.
The file signature extraction according to the embodiment of the application refers to extracting partial binary content of the malicious software as a signature for identifying the malicious software by using an automatic technology. The malicious software comprises trojans, viruses, backdoor programs, advertisement software and the like. According to the method and the device, the malicious software is identified by automatically extracting the signature, so that the problems that the workload of malicious software analysts is too large and the efficiency of extracting the signature is low are solved, and the application needs are met.
The file signature extraction method and device provided by the embodiment of the application can be applied to a server, a firewall, gateway equipment or a terminal taking a host as an example.
Optionally, the file signature extraction method and apparatus provided in the embodiment of the present application may be applied to the application scenarios shown in fig. 1, fig. 2, and fig. 3. Fig. 1, fig. 2, and fig. 3 only describe, by way of example, three possible application scenarios of the file signature extraction method provided in the embodiment of the present application, and the application scenarios of the file signature extraction method provided in the embodiment of the present application are not limited to the application scenarios shown in fig. 1, fig. 2, and fig. 3.
Fig. 1 is a schematic diagram of an enterprise network architecture. In fig. 1, the enterprise network architecture includes an analysis device 101, a network access device 102 such as a firewall or a security gateway, a switch 103 connected to the network access device 102, and a plurality of hosts 104 connected to the switch. Wherein the analysis device 101 is connected to the network access device 102. The analysis device 101 may be, for example, an Intrusion Prevention System (IPS) device or a Unified Threat Management (UTM) device, etc. The analysis device 101 is configured to extract a signature of the malicious binary software, where the signature is used to identify the malicious binary software, and receive the malicious binary software sent by a firewall or a security gateway in the device 102, or receive the malicious binary software sent by client software installed on the intranet host 104, and output the signature of the malicious binary software.
Fig. 2 is a schematic diagram of a cloud network architecture. In fig. 2, the cloud network architecture includes an analysis device 201 located on the core network side, and a plurality of firewall devices 202 in the access network. The analysis device 201 can be used for extracting a signature of the malicious binary software, wherein the signature is used for identifying the malicious binary software, receiving the malicious binary software from the device 202 of the firewall, and outputting the signature of the malicious binary software.
Fig. 3 is a schematic diagram of a terminal architecture. In fig. 3, taking a terminal as a mobile phone as an example, the mobile phone actually carries the function of the analysis device 301, and the analysis device 301 may receive an operation instruction of a user to perform corresponding processing. Illustratively, the user may input an extraction instruction to the handset, and the analysis device 301 extracts the signature of the malicious binary software according to the extraction instruction. The user may also input an output instruction to the mobile phone, and the analysis device 301 outputs the signature of the malicious binary software according to the output instruction. Therefore, the malicious software is identified by automatically extracting the signature, the workload is low, the signature extraction efficiency is high, and the application requirements are met.
It should be understood that the network architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The document signature extraction method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings. The execution subject of the method may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3. The workflow of the analysis apparatus 101, the analysis apparatus 201, and the analysis apparatus 301 mainly includes an extraction phase and a selection phase. In the extraction phase, the analysis device 101, the analysis device 201 and the analysis device 301 extract at least two pieces of binary content from a first set of software containing a first number of non-malware binary software and a second set of software containing a second number of malware binary software. In the selection phase, the analysis device 101, the analysis device 201 and the analysis device 301 obtain at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type statistical indicator of each piece of binary content in the at least two pieces of binary content in the first software set and the first type statistical indicator in the second software set, wherein the first type statistical indicator includes at least one of occurrence frequency and occurrence probability; and obtaining at least one section of binary content from the candidate binary content as target binary content according to a second class statistical index of each section of binary content in the candidate binary content in the first software set and the second class statistical index in the second software set, wherein the second class statistical index comprises at least one of software coverage ratio and set similarity, and further, obtaining a signature of corresponding malicious binary software according to the target binary content, and the signature is used for identifying the malicious binary software.
The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 4 is a schematic flowchart of a file signature extraction method provided in an embodiment of the present application, where an execution subject in this embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 4, the method may include the following steps.
S401: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malware binary software and a second set of software containing a second number of malware binary software.
Here, the first software set may include a plurality of non-malicious binary software, and the first number may be set according to practical situations, which is not particularly limited in the embodiment of the present application. Similarly, the second software set may include a plurality of malicious binary software, and the second number may be set according to an actual situation, which is not particularly limited in this embodiment of the application.
In this embodiment of the application, the analysis device may receive a plurality of non-malicious binary software input by an external device (for example, the device with the firewall deployed therein), or may obtain the plurality of non-malicious binary software from the non-malicious software stored in the memory, and specifically how to obtain the non-malicious binary software may be determined according to an actual situation, which is not particularly limited in this embodiment of the application.
The malicious binary software may be the malicious binary software which needs signature extraction. The analysis device may receive a plurality of malicious binary software input by the external device, or may obtain the plurality of malicious binary software from the malicious software stored in the memory and required to be signed and extracted, and specifically how to obtain the malicious binary software may be determined according to an actual situation, which is not particularly limited in the embodiment of the present application.
In some possible embodiments, the extracting at least two pieces of binary content from the first software set and the second software set includes:
and respectively extracting binary contents with preset length for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
The preset length may be set according to actual needs, and may be a fixed length or a non-fixed length, which is not particularly limited in the embodiments of the present application.
Illustratively, for each binary file in the first software set and the second software set, a sliding window manner is adopted to extract binary contents with fixed length (for example, 4 bytes).
Specifically, a window with a size of k bytes may be set, the sliding direction is from left to right along the binary file (i.e. sliding from a low address to a high address in the storage space occupied by the binary file), and the displacement of each sliding is one byte, where k is a natural number greater than 1. And for each binary file in the first software set and the second software set, sliding the window from the left to the right, sliding one unit at a time, and extracting binary contents with the size of k bytes.
The binary content of each binary file in the first software set and the second software set is extracted quickly through the sliding window, and the method is simple and convenient and meets the application requirements.
In addition, after the binary contents are extracted, the position and the frequency of the first appearance of each extracted binary content in the binary file can be recorded, so that the binary contents meeting the requirements can be obtained from the extracted binary contents in the subsequent processing according to the position and the frequency.
S402: and obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to the first class statistical indexes of each section of binary content in the at least two sections of binary content in the first software set and the first class statistical indexes in the second software set, wherein the first class statistical indexes comprise at least one of the occurrence frequency and the occurrence probability.
Here, the frequency of occurrence may be the number of occurrences of the binary content.
The occurrence probability may be a conditional probability. Here, the conditional probability refers to a probability that the event a occurs under the condition that the event B occurs. The conditional probability is expressed as: p (a | B), which represents the probability that a occurs under the conditions under which B occurs. If there are only two events, a, B, then,
Figure BDA0002320361180000111
taking binary content abcd as an example, the probability of occurrence in a certain malware species is represented as: p (abcd)
P (abcd) ═ p (b | a) × p (c | b) × p (d | c), where the probability of binary occurrence of the segment abcd is the product of four probabilities, and p (a) represents the proportion of the number of occurrences of the character a to the number of occurrences of all characters in the malware.
In some possible embodiments, when the first type of statistical indicator includes the occurrence frequency, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, includes:
obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the appearance frequency of each binary content in the first binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the occurrence frequency, according to the occurrence frequency of the at least two pieces of binary content in the non-malicious binary software in the first software set and the occurrence frequency of the at least two pieces of binary content in the malicious binary software in the second software set, selecting binary content with low occurrence frequency in the non-malicious binary software and high occurrence frequency in the malicious binary content from the at least two pieces of binary content as candidate binary content.
In some possible embodiments, when the first type of statistical indicator includes the occurrence probability, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, includes:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in a first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold is smaller than the second preset probability threshold.
Illustratively, if the first type of statistical indicator includes the occurrence probability, the binary content with low occurrence probability in non-malware and high occurrence probability in malware is selected from the at least two pieces of binary content as candidate binary content according to the occurrence probability of the at least two pieces of binary content in non-malware of the first software set and the occurrence probability of the at least two pieces of binary content in malware of the second software set.
In some possible embodiments, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set respectively includes:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in a first software set and the occurrence frequency of each section of binary content in a second software set, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each binary content in the first binary content to be processed in the first software set from the occurrence probability of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence probability of each binary content in the first binary content to be processed in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary content in the first binary contents to be processed in the first software set and the occurrence probability of each binary content in the second software set.
Here, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content as a first binary content to be processed according to the occurrence frequency, and further, candidate binary content may be obtained from the first binary content to be processed according to the occurrence probability.
Similarly, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one section of binary content may be obtained from the at least two sections of binary content according to the occurrence probability to serve as a second binary content to be processed, and further, a candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Specifically, when the first-class statistical indicator includes the occurrence frequency and the occurrence probability, obtaining at least one segment of binary content from the at least two segments of binary content according to the first-class statistical indicator of each segment of binary content in the at least two segments of binary content in the first software set and the first-class statistical indicator in the second software set, and using the at least one segment of binary content as a candidate binary content includes:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability in the second software set, obtaining at least one binary content from the at least two binary contents as a second binary content to be processed;
obtaining the occurrence frequency of each binary content in the second binary content to be processed in the first software set from the occurrence frequency of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence frequency of each binary content in the second binary content to be processed in the second software set from the occurrence frequency of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set.
Here, for the specific parameters included in the first-class statistical indexes, the specific technical means for obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content is refined, so that different application requirements in different application scenes can be met, and the method is suitable for application.
S403: and obtaining at least one section of binary content from the candidate binary content as target binary content according to the second type statistical indexes of each section of binary content in the candidate binary content in the first software set and the second type statistical indexes in the second software set, wherein the second type statistical indexes comprise at least one of software coverage ratio and set similarity.
Here, the software coverage ratio of the candidate binary content in the first software set may be understood as the coverage ratio of the candidate binary content in the first software set, for example, if there are 10 binary software in the first software set and 5 binary software in the first software set are the candidate binary content, the software coverage ratio of the candidate binary content in the first software set is 50%.
Similarly, the software coverage ratio of the candidate binary content in the second software set may be understood as the coverage ratio of the candidate binary content in the second software set.
The set similarity can be determined by a Jaccard coefficient, which is also called Jaccard similarity coefficient (Jaccard similarity coefficient) for comparing similarity and difference between finite sample sets. The larger the Jaccard coefficient value, the higher the sample similarity. Given two sets A and B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B, as follows:
Figure BDA0002320361180000131
here, the set similarity of the candidate binary content in the first software set may be understood as a similarity of each set in which the candidate binary content appears in the first software set, for example, there are 100 binary software in the first software set, the candidate binary content includes a first binary content and a second binary content, the first binary content appears in 20 binary software in the first software set, the 20 binary software is taken as the first set, the second binary content appears in 10 binary software in the first software set, the 10 binary software is taken as the second set, and the similarity between the first set and the second set is determined according to the Jaccard coefficient as the set similarity of the candidate binary content in the first software set.
Similarly, the set similarity of the candidate binary content in the second software set may be understood as the similarity of each set of the candidate binary content appearing in the second software set.
In some possible embodiments, when the second type of statistical indicator includes the software coverage ratio, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in a first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage ratio of each binary content in the seventh binary content in the second software set from the software coverage ratio of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting binary contents with low software coverage ratio in non-malware and high software coverage ratio in malware from the candidate binary contents as target binary contents according to the software coverage ratio of the candidate binary contents in non-malware binary software of the first software set and the software coverage ratio in malware of the second software set.
In other possible embodiments, when the second type of statistical indicator includes the set similarity, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
obtaining binary contents with the set similarity lower than a first preset similarity threshold in the first software set from the candidate binary contents as eighth binary contents;
acquiring the set similarity of each binary content in the eighth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the set similarity higher than a second preset similarity threshold in the second software set from the eighth binary content as the target binary content, wherein the first preset similarity threshold is smaller than the second preset similarity threshold.
Illustratively, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting binary contents with low set similarity in the non-malicious binary contents and high software set similarity in the malicious binary contents as the target binary contents.
In some other possible embodiments, when the second type of statistical indicator includes the software coverage ratio and the set similarity, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the candidate binary content in the second software set;
acquiring set similarity of each segment of binary content in the first characteristic binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and acquiring set similarity of each segment of binary content in the first characteristic binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and acquiring target binary contents from the first characteristic binary contents according to the set similarity of each section of binary contents in the first characteristic binary contents in the first software set and the set similarity of each section of binary contents in the second software set.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content as the first characteristic binary content according to the software coverage ratio, and further, the target binary content may be obtained from the first characteristic binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content as a second characteristic binary content according to the set similarity, and further, a target binary content may be obtained from the second characteristic binary content according to the software coverage ratio.
Specifically, when the second type of statistical indicator includes the software coverage ratio and the set similarity, obtaining at least one section of binary content from the candidate binary content according to the second type of statistical indicator of each section of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, as the target binary content, including:
acquiring at least one section of binary content from the candidate binary content as second characteristic binary content according to the set similarity of each section of binary content in the candidate binary content in the first software set and the set similarity of each section of binary content in the second software set;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each piece of binary content in the second characteristic binary content in the first software set and the software coverage proportion of each piece of binary content in the second software set.
Here, specific technical means for obtaining the target binary content from the candidate binary content are refined according to specific parameters included in the second type of statistical indexes, so that different application requirements in different application scenarios can be met, and the method is suitable for application.
In addition, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining information entropy values respectively corresponding to each section of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each section of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value, the more random the content of the context is, and the lower the information entropy value, the more consistent the content of the context is. And if the information entropy value of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold value, which indicates that the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, deleting the candidate binary content (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set). Here, the preset entropy threshold may be set according to actual conditions, and the application is not particularly limited in this regard.
The information entropy is calculated as follows:
Figure BDA0002320361180000151
wherein p isiIndicating the ratio of the number of occurrences of the ith character to the number of occurrences of all characters.
After the information entropy value of the candidate binary content in the malicious binary software of the second software set is determined, the candidate binary content is screened according to the information entropy value, and the binary content with the information entropy value higher than a preset entropy value threshold value is deleted from the candidate binary content, so that the subsequent processing result is more accurate, the subsequent processing is simpler, and the application requirement is met.
S404: and obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
Optionally, first, the target binary content corresponding to each piece of malicious binary software is determined according to the corresponding relationship between the target binary content and the malicious binary software in the second software set, then, the target binary content corresponding to each piece of malicious binary software is combined, and the combined result is used as a signature of the corresponding malicious binary software for identifying the malicious binary software.
Optionally, the target binary content corresponding to each malicious binary software is combined according to a preset combination requirement, where the preset combination requirement may be set according to an actual situation, and the comparison in the embodiment of the present application is not particularly limited.
In some possible embodiments, if the candidate binary content includes at least two pieces of binary content, the method further includes, after obtaining the at least two pieces of binary content from the binary content as the candidate binary content:
determining an importance ranking rule according to attribute information corresponding to each section of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following: a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, a probability of occurrence of the candidate binary content in the first software set, a probability of occurrence of the candidate binary content in the second software set, a location of the candidate binary content at a first occurrence in each binary software of the first software set and a mean, variance and entropy of the location, a location of the candidate binary content at a first occurrence in each binary software of the second software set and a mean, variance and entropy of the location, and printable characters of the candidate binary content;
according to the importance ranking rule, ranking at least two sections of binary contents in the candidate binary contents according to the importance degree from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
The method comprises the steps of determining an importance ordering rule according to attribute information of the candidate binary contents, ordering the candidate binary contents from high importance to low importance according to the importance ordering rule, determining an ordering result of the target binary contents according to the ordering result of the candidate binary contents, and deleting the binary contents with ordering serial numbers behind preset serial numbers from the target binary contents, so that signatures of corresponding malicious binary software can be generated according to the binary contents with higher importance, and the generated signatures can identify the malicious binary software more accurately.
According to the embodiment of the application, at least two sections of binary contents are extracted from a first software set and a second software set, then at least one section of binary contents is obtained from the at least two sections of binary contents as candidate binary contents according to the first class statistical indexes of the at least two sections of binary contents in the first software set and the first class statistical indexes of the at least two sections of binary contents in the second software set, further, a target binary content is obtained from the candidate binary contents according to the second class statistical indexes of the candidate binary contents in the first software set and the second class statistical indexes of the candidate binary contents in the second software set, and further, a corresponding signature of malicious binary software is obtained according to the target binary content, so that the purpose of extracting part of binary contents of malicious software by using an automatic technology as a signature for identifying the malicious software is achieved without depending on manual analysis, the problem that the workload of malicious software analysts is too large and the efficiency of extracting the signature is low is solved, the efficiency of extracting the signature of the malicious software is improved, and the application requirements are met. In addition, the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of an analyst, and the extraction accuracy of the signature of the malicious software is improved to a certain extent.
Fig. 5 is a schematic flowchart of another file signature extraction method provided in an embodiment of the present application, where an execution subject in this embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 5, the method may include:
s501: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malware binary software and a second set of software containing a second number of malware binary software.
Step S501 is the same as the implementation of step S401, and is not described herein again.
Optionally, the analysis device selects to execute one or more sub-processes: the sub-flow composed of steps S502 to S504, or the sub-flow composed of steps S505 to S507, or the sub-flow composed of steps S508 to S510.
S502: when the first type of statistical index comprises the occurrence frequency, obtaining the binary content with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary content as the first binary content.
S503: and obtaining the appearance frequency of each binary content in the first binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set.
S504: and obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the occurrence frequency, according to the occurrence frequency of the at least two pieces of binary content in non-malicious binary software in the first software set and the occurrence frequency of the at least two pieces of binary content in malicious binary software in the second software set, binary content with low occurrence frequency in non-malicious binary software and high occurrence frequency in malicious binary software is selected from the at least two pieces of binary content and is used as candidate binary content.
S505: and when the first-class statistical indexes comprise the occurrence probability, obtaining the binary content with the occurrence probability lower than a first preset probability threshold value in the first software set from the at least two sections of binary content as a second binary content.
S506: and obtaining the occurrence probability of each binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set.
S507: and obtaining binary contents with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold is smaller than the second preset probability threshold.
Illustratively, if the first type of statistical indicator includes the occurrence probability, the binary content with low occurrence probability in non-malware and high occurrence probability in malware is selected from the at least two pieces of binary content as candidate binary content according to the occurrence probability of the at least two pieces of binary content in non-malware of the first software set and the occurrence probability of the at least two pieces of binary content in malware of the second software set.
S508: when the first type of statistical index includes the occurrence frequency and the occurrence probability, at least one section of binary content is obtained from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in the first software set and the occurrence frequency in the second software set, and the at least one section of binary content is used as the first binary content to be processed.
In some possible embodiments, the obtaining at least one piece of binary content from the at least two pieces of binary content as the first binary content to be processed according to the occurrence frequency of each piece of binary content in the at least two pieces of binary content in the first software set and the occurrence frequency of each piece of binary content in the second software set respectively includes:
obtaining the binary content with the occurrence frequency lower than a third preset frequency threshold value in the first software set from the at least two sections of binary content as a third binary content;
obtaining the appearance frequency of each binary content in the third binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining the binary content with the occurrence frequency higher than a fourth preset frequency threshold value in the second software set from the third binary content as the first binary content to be processed, wherein the third preset frequency threshold value is smaller than the fourth preset frequency threshold value.
And selecting binary contents with low occurrence frequency in non-malicious software and high occurrence frequency in malicious software from the at least two sections of binary contents as the first binary contents to be processed according to the occurrence frequency of the at least two sections of binary contents in non-malicious binary software of the first software set and the occurrence frequency of the at least two sections of binary contents in malicious binary software of the second software set.
S509: obtaining the occurrence probability of each binary content in the first binary content to be processed in the first software set from the occurrence probability of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence probability of each binary content in the first binary content to be processed in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set.
S510: and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents in the first binary contents to be processed in the first software set and the occurrence probability of each binary contents in the second software set.
In some possible embodiments, the obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents to be processed in the first set of software and the occurrence probability of each binary contents to be processed in the second set of software includes:
obtaining the binary content with the occurrence probability lower than a third preset probability threshold in the first software set from the first binary content to be processed as a fourth binary content;
obtaining the occurrence probability of each binary content in the fourth binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a fourth preset probability threshold value in the second software set from the fourth binary contents as candidate binary contents, wherein the third preset probability threshold value is smaller than the fourth preset probability threshold value.
Here, binary contents with low occurrence probability in non-malware and high occurrence probability in malware are selected from the above-mentioned first binary contents to be processed as candidate binary contents, based on the occurrence probability of the above-mentioned first binary contents to be processed in non-malware of the first software set and the occurrence probability of malicious binary contents in the second software set.
In addition, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the at least one section of binary content is obtained from the at least two sections of binary content according to the occurrence frequency, and is used as a first binary content to be processed, and further, a candidate binary content is obtained from the first binary content to be processed according to the occurrence probability.
Similarly, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one section of binary content may be obtained from the at least two sections of binary content according to the occurrence probability to serve as a second binary content to be processed, and further, a candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Illustratively, obtaining at least one piece of binary content from the at least two pieces of binary content as a second binary content to be processed according to the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set and the occurrence probability in the second software set respectively comprises:
obtaining the binary content with the occurrence probability lower than a fifth preset probability threshold in the first software set from the at least two sections of binary content as a fifth binary content;
obtaining the occurrence probability of each binary content in the fifth binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining the binary content with the occurrence probability higher than a sixth preset probability threshold in the second software set from the fifth binary content as the second binary content to be processed, wherein the fifth preset probability threshold is smaller than the sixth preset probability threshold.
And selecting binary contents with low occurrence probability in the non-malicious software and high occurrence probability in the malicious software from the at least two sections of binary contents as second binary contents to be processed according to the occurrence probability of the at least two sections of binary contents in the non-malicious binary software of the first software set and the occurrence probability in the malicious binary software of the second software set.
In some possible embodiments, obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set respectively includes:
obtaining the binary content with the occurrence frequency lower than a fifth preset frequency threshold value in the first software set from the second binary content to be processed as a sixth binary content;
obtaining the appearance frequency of each binary content in the sixth binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence frequency higher than a sixth preset frequency threshold value in the second software set from the sixth binary contents as candidate binary contents, wherein the fifth preset frequency threshold value is smaller than the sixth preset frequency threshold value.
Here, according to the frequency of occurrence of the second binary content to be processed in the non-malicious binary software of the first software set and the frequency of occurrence of the second binary content to be processed in the malicious binary software of the second software set, the binary content with low frequency of occurrence in the non-malicious binary software and high frequency of occurrence in the malicious binary software is selected from the second binary content to be processed as the candidate binary content.
S511: and obtaining at least one section of binary content from the candidate binary content as target binary content according to the second class statistical indexes of each section of binary content in the candidate binary content in the first software set and the second class statistical indexes in the second software set, wherein the second class statistical indexes comprise at least one of software coverage ratio and set similarity.
S512: and obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
The steps S511-S512 are the same as the steps S403-S404, and are not described herein again.
In the embodiment of the application, after at least two sections of binary contents are extracted from a first software set and a second software set, the binary contents with low occurrence frequency and/or low occurrence probability in non-malware and high occurrence frequency and/or high occurrence probability in the malware are selected from the at least two sections of binary contents as candidate binary contents, so that the subsequently extracted signatures are general enough to cover most of malware, and are also enough to avoid overlapping with the non-malware contents to reduce false alarm, further, the target binary contents are obtained from the candidate binary contents, the corresponding signatures of the malware are obtained according to the target binary contents, and therefore, under the condition of not depending on manual analysis, the partial binary contents of the malware are extracted by utilizing an automatic technology to serve as the signatures for identifying the malware, the problem that the workload of malicious software analysts is too large and the efficiency of extracting the signature is low is solved, the efficiency of extracting the signature of the malicious software is improved, and the application requirements are met. In addition, the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of an analyst, and the extraction accuracy of the signature of the malicious software is improved to a certain extent.
Fig. 6 is a schematic flowchart of another file signature extraction method provided in an embodiment of the present application, where an execution subject in this embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 6, the method may include:
s601: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malware binary software and a second set of software containing a second number of malware binary software.
S602: and obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to the first class statistical indexes of each section of binary content in the at least two sections of binary content in the first software set and the first class statistical indexes in the second software set, wherein the first class statistical indexes comprise at least one of the occurrence frequency and the occurrence probability.
The implementation of steps S601 to S602 is the same as that of steps S401 to S402, and is not described herein again.
Optionally, the analysis device selects to execute one or more sub-processes: the sub-flow composed of steps S603 to S605, or the sub-flow composed of steps S606 to S608, or the sub-flow composed of steps S609 to S611.
S603: and when the second type of statistical index comprises the software coverage proportion, obtaining the binary content of which the software coverage proportion in the first software set is lower than a first preset proportion threshold value from the candidate binary content as seventh binary content.
S604: and obtaining the software coverage ratio of each binary content in the seventh binary content in the second software set from the software coverage ratio of each binary content in the candidate binary content in the second software set.
S605: and obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting binary contents with low software coverage ratio in non-malware and high software coverage ratio in malware from the candidate binary contents as target binary contents according to the software coverage ratio of the candidate binary contents in non-malware binary software of the first software set and the software coverage ratio in malware of the second software set.
S606: and when the second type of statistical indexes comprise the set similarity, obtaining the binary content with the set similarity lower than a first preset similarity threshold value in the first software set from the candidate binary content as eighth binary content.
S607: and obtaining the set similarity of each binary content in the eighth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set.
S608: and obtaining the binary content with the set similarity higher than a second preset similarity threshold in the second software set from the eighth binary content as the target binary content, wherein the first preset similarity threshold is smaller than the second preset similarity threshold.
Illustratively, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting binary contents with low set similarity in the non-malicious binary contents and high software set similarity in the malicious binary contents as the target binary contents.
S609: and when the second type of statistical indexes comprise the software coverage proportion and the set similarity, acquiring at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set.
In some possible embodiments, the obtaining at least one piece of binary content from the candidate binary content according to the software coverage ratio of each piece of binary content in the candidate binary content in the first software set and the software coverage ratio of the second software set respectively as the first feature binary content includes:
obtaining binary contents with the software coverage proportion lower than a third preset proportion threshold value in the first software set from the candidate binary contents as ninth binary contents;
obtaining the software coverage ratio of each segment of binary content in the ninth binary content in the second software set from the software coverage ratio of each segment of binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a fourth preset ratio threshold value in the second software set from the ninth binary content as the first characteristic binary content, wherein the third preset ratio threshold value is smaller than the fourth preset ratio threshold value.
Here, binary contents with a low software coverage ratio in non-malware and a high software coverage ratio in malware are selected from the candidate binary contents as the first feature binary contents according to the software coverage ratio of the candidate binary contents in non-malware binary software of the first software set and the software coverage ratio of the candidate binary contents in malware binary software of the second software set.
S610: and obtaining the set similarity of each segment of binary content in the first characteristic binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each segment of binary content in the first characteristic binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set.
S611: and acquiring target binary contents from the first characteristic binary contents according to the set similarity of each section of binary contents in the first characteristic binary contents in the first software set and the set similarity of each section of binary contents in the second software set.
In some possible embodiments, the obtaining the target binary content from the first feature binary content according to the set similarity of each piece of binary content in the first feature binary content in the first software set and the set similarity of the second software set respectively includes:
obtaining binary contents with the set similarity lower than a third preset similarity threshold in the first software set from the first characteristic binary contents as twelfth binary contents;
acquiring the set similarity of each binary content in the twelfth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the set similarity higher than a fourth preset similarity threshold in the second software set from the twelfth binary content as the target binary content, wherein the third preset similarity threshold is smaller than the fourth preset similarity threshold.
According to the set similarity of the first characteristic binary contents in the non-malicious binary software of the first software set and the set similarity of the first characteristic binary contents in the malicious binary software of the second software set, selecting binary contents with low set similarity in the non-malicious binary software and high software set similarity in the malicious binary contents from the first characteristic binary contents as target binary contents.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the first step obtains at least one piece of binary content from the candidate binary content as a first feature binary content according to the software coverage ratio, and further obtains a target binary content from the first feature binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content as a second characteristic binary content according to the set similarity, and further, a target binary content may be obtained from the second characteristic binary content according to the software coverage ratio.
Illustratively, obtaining at least one piece of binary content from the candidate binary content as the second characteristic binary content according to the set similarity of each piece of binary content in the candidate binary content in the first software set and the set similarity of each piece of binary content in the candidate binary content in the second software set respectively comprises:
obtaining binary contents with the set similarity lower than a fifth preset similarity threshold in the first software set from the candidate binary contents as eleventh binary contents;
acquiring the set similarity of each binary content in the eleventh binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining binary contents with the set similarity higher than a sixth preset similarity threshold in the second software set from the eleventh binary contents as second characteristic binary contents, wherein the fifth preset similarity threshold is smaller than the sixth preset similarity threshold.
And selecting binary contents with low set similarity in the non-malicious software and high set similarity in the malicious software from the candidate binary contents as second characteristic binary contents according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity in the malicious binary software of the second software set.
In some possible embodiments, obtaining the target binary content from the second feature binary content according to the software coverage ratio of each piece of binary content in the second feature binary content in the first software set and the software coverage ratio in the second software set respectively includes:
obtaining binary contents with the software coverage proportion lower than a fifth preset proportion threshold value in the first software set from the second characteristic binary contents as twelfth binary contents;
obtaining the software coverage proportion of each segment of binary content in the twelfth binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a sixth preset ratio threshold value in the second software set from the twelfth binary content as the target binary content, wherein the fifth preset ratio threshold value is smaller than the sixth preset ratio threshold value.
Here, binary contents with a low software coverage ratio in non-malware and a high software coverage ratio in malware are selected from the second characteristic binary contents as candidate binary contents according to the software coverage ratio of the second characteristic binary contents in non-malware binary software of the first software set and the software coverage ratio of the second characteristic binary contents in malware of the second software set.
S612: and obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
Step S612 is implemented in the same manner as step S404, and is not described herein again.
In the embodiment of the application, after at least two sections of binary contents are extracted from a first software set and a second software set, at least one section of binary contents is obtained from the at least two sections of binary contents as candidate binary contents according to a first type statistical indicator of each section of binary contents in the at least two sections of binary contents in the first software set and a first type statistical indicator in the second software set, wherein the first type statistical indicator comprises at least one of occurrence frequency and occurrence probability, then, a binary content with a low software coverage ratio and/or set similarity in non-malware and a binary content with a high software coverage ratio and/or set similarity in malware are selected from the candidate binary contents as a target binary content, and a signature of corresponding malware is obtained according to the target binary content, therefore, the extracted signature is general enough to cover most malicious software, meanwhile, the overlap of non-malicious software content is avoided enough to reduce false alarm, meanwhile, under the condition of not depending on manual analysis, partial binary content of the malicious software is extracted by using an automatic technology to serve as the signature for identifying the malicious software, the problems that the workload of malicious software analysts is too large, and the efficiency of extracting the signature is low are solved, and the application requirements are met. In addition, the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of an analyst, and the extraction accuracy of the signature of the malicious software is improved to a certain extent.
Fig. 7 is a schematic flowchart of another file signature extraction method provided in an embodiment of the present application, where an execution subject in this embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 7, the method may include:
s701: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malware binary software and a second set of software containing a second number of malware binary software.
S702: and obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to the first class statistical indexes of each section of binary content in the at least two sections of binary content in the first software set and the first class statistical indexes in the second software set, wherein the first class statistical indexes comprise at least one of the occurrence frequency and the occurrence probability.
The steps S701 to S702 are the same as the steps S401 to S402, and are not described herein again.
S703: and determining information entropy values respectively corresponding to each section of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each section of binary content in the binary software of the second software set.
Here, the context character includes a context character and a context character, and the information entropy value includes a context entropy value and a context entropy value. The upper characters are characters with a preset number before each section of binary content in the candidate binary content corresponds to the content in the binary software of the second software set, and the lower characters are characters with a preset number after each section of binary content in the candidate binary content corresponds to the content in the binary software of the second software set. Here, the pre-set number and the post-set number may be determined according to actual situations, and the embodiment of the present application does not particularly limit this.
The determining, according to the frequency of occurrence of the contextual characters of each segment of the binary content in the candidate binary content in the binary software of the second software set, information entropy values corresponding to each segment of the binary content in the candidate binary content respectively includes:
determining the upper entropy value corresponding to each segment of binary content in the candidate binary content according to the ratio of the occurrence frequency of each segment of binary content in the binary software of the second software set to the occurrence frequency of all characters in the corresponding upper characters;
and determining the context entropy value corresponding to each section of binary content in the candidate binary content according to the proportion of the occurrence frequency of each context character in the binary software of the second software set in the occurrence frequency of all characters in the corresponding context character.
S704: and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value, the more random the content of the context is, and the lower the information entropy value, the more consistent the content of the context is. And if the information entropy value of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold value, which indicates that the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, deleting the candidate binary content (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set).
S705: and determining an importance ordering rule according to the attribute information corresponding to each section of binary content in the candidate binary content.
The attribute information includes at least one of the number of the candidate binary content in the first software set, the number of the candidate binary content in the second software set, the occurrence probability of the candidate binary content in the first software set, the occurrence probability of the candidate binary content in the second software set, the position of the candidate binary content in each binary software of the first software set for the first time and the mean, variance and entropy of the position, the position of the candidate binary content in each binary software of the second software set for the first time and the mean, variance and entropy of the position, and printable characters of the candidate binary content.
Here, the attribute information may be adjusted according to actual situations, and this is not limited in this embodiment of the present application.
S706: and sequencing at least two sections of binary contents in the candidate binary contents according to the importance sequencing rule and the importance degree from high to low.
Illustratively, the importance ranking rule includes the following in order from high importance to low importance:
the number of the binary contents at the same position of the second software set is higher than a first preset number threshold;
the variance of the positions of the binary contents appearing in each binary software of the first software set is smaller than a first preset variance threshold, the variance of the positions of the corresponding binary contents appearing in each binary software of the second software set is smaller than a second preset variance threshold, the entropy of the positions of the corresponding binary contents appearing in each binary software of the first software set is smaller than a first preset entropy threshold, the entropy of the positions of the corresponding binary contents appearing in each binary software of the second software set is smaller than a second preset entropy threshold, and the ratio of the occurrence probability of the corresponding binary contents in the first software set to the occurrence probability in the second software set is larger than a preset ratio threshold;
the distance between the position where the binary content appears for the first time in each binary software of the first software set and the file head is lower than a first distance threshold value, and the distance between the position where the corresponding binary content appears for the first time in each binary software of the second software set and the file head is lower than a second distance threshold value;
the number of binary contents in the second software set is higher than a second preset number threshold, the variance of the positions of the corresponding binary contents in each binary software of the second software set is lower than a third preset variance threshold, the variance of the positions of the corresponding binary contents appearing in each binary software of the second software set is smaller than a fourth preset variance threshold, and the entropy of the positions of the corresponding binary contents appearing in each binary software of the second software set is smaller than a third preset entropy threshold;
the number of binary contents in the first software set is lower than a third preset number threshold, the variance of the positions of the corresponding binary contents appearing in each binary software of the first software set is greater than a fifth preset variance threshold, and the entropy of the positions of the corresponding binary contents appearing in each binary software of the first software set is greater than a fourth preset entropy threshold;
the number of printable characters of the binary content is greater than a fourth preset number threshold.
Here, the importance ranking rule (including content, ranking, etc.) may be adjusted according to actual situations, and this is not limited in the embodiment of the present application.
S707: and obtaining at least one section of binary content from the candidate binary content as target binary content according to the second class statistical indexes of each section of binary content in the candidate binary content in the first software set and the second class statistical indexes in the second software set, wherein the second class statistical indexes comprise at least one of software coverage ratio and set similarity.
S708: and determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content.
S709: and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
S710: and obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
In the embodiment of the application, at least two sections of binary contents are extracted from a first software set and a second software set, then at least one section of binary contents is obtained from the at least two sections of binary contents to be used as candidate binary contents, further, the binary contents with the information entropy value higher than a preset entropy value threshold value are deleted from the candidate binary contents, wrong candidate binary contents are deleted, the accuracy of subsequent processing results is improved, then an importance ranking rule is determined according to the attribute information corresponding to each section of binary contents in the candidate binary contents, further, the candidate binary contents are ranked from high to low according to the importance ranking rule, a target binary content is obtained from the candidate binary contents, and a ranking result of the target binary content is determined according to the ranking result of the candidate binary contents, the binary content with the sequencing sequence number after the preset sequence number is deleted from the target binary content, and then, the signature of the corresponding malicious binary software can be generated subsequently according to the binary content with higher importance, so that the generated signature can more accurately identify the malicious binary software, meanwhile, partial binary content of the malicious software is extracted by using an automatic technology to serve as the signature for identifying the malicious software, the problems that the workload of malicious software analysts is too large, and the signature extraction efficiency is low are solved, and the application requirements are met. In addition, the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of an analyst, and the extraction accuracy of the signature of the malicious software is improved to a certain extent.
Fig. 8 is a schematic structural diagram of a document signature extraction apparatus provided in the present application, where the apparatus includes: an extraction module 801, a first obtaining module 802, a second obtaining module 803, and a third obtaining module 804.
The extracting module 801 is configured to extract at least two pieces of binary content from a first software set and a second software set, where the first software set includes a first number of non-malicious binary software, and the second software set includes a second number of malicious binary software.
A first obtaining module 802, configured to obtain at least one segment of binary content from the at least two segments of binary content as candidate binary content according to a first class statistical indicator of each of the at least two segments of binary content in a first software set and a first class statistical indicator of each of the at least two segments of binary content in a second software set, where the first class statistical indicator includes at least one of an occurrence frequency and an occurrence probability.
A second obtaining module 803, configured to obtain at least one piece of binary content from the candidate binary content as a target binary content according to a second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, respectively, where the second type of statistical indicator includes at least one of a software coverage ratio and a set similarity.
A third obtaining module 804, configured to obtain a signature of the corresponding malicious binary software according to the target binary content, where the signature is used to identify the malicious binary software.
In one possible design, the extracting module 801 is specifically configured to:
and respectively extracting binary contents with preset lengths for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
In one possible design, when the first type of statistical indicator includes the occurrence frequency, the first obtaining module 802 is specifically configured to:
obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the appearance frequency of each binary content in the first binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
In one possible design, when the first type of statistical indicator includes the occurrence probability, the first obtaining module 802 is specifically configured to:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in the first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold is smaller than the second preset probability threshold.
In one possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module 802 is specifically configured to:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in a first software set and the occurrence frequency of each section of binary content in a second software set, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each binary content in the first binary content to be processed in the first software set from the occurrence probability of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence probability of each binary content in the first binary content to be processed in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents in the first binary contents to be processed in the first software set and the occurrence probability of each binary contents in the second software set.
In one possible design, the first obtaining module 802 obtains at least one piece of binary content from the at least two pieces of binary content as the first binary content to be processed according to the occurrence frequency of each piece of binary content in the at least two pieces of binary content in the first software set and the occurrence frequency of each piece of binary content in the second software set, and includes:
obtaining the binary content with the occurrence frequency lower than a third preset frequency threshold value in the first software set from the at least two sections of binary content, and using the binary content as a third binary content;
obtaining the appearance frequency of each binary content in the third binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set respectively;
and obtaining the binary content with the occurrence frequency higher than a fourth preset frequency threshold value in the second software set from the third binary content as the first binary content to be processed, wherein the third preset frequency threshold value is smaller than the fourth preset frequency threshold value.
In one possible design, the first obtaining module 802 obtains candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary content in the first binary contents to be processed in the first software set and the occurrence probability in the second software set, respectively, and includes:
obtaining the binary content with the occurrence probability lower than a third preset probability threshold in the first software set from the first binary content to be processed as a fourth binary content;
obtaining the occurrence probability of each binary content in the fourth binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence probability higher than a fourth preset probability threshold value in the second software set from the fourth binary contents as candidate binary contents, wherein the third preset probability threshold value is smaller than the fourth preset probability threshold value.
In one possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module 802 is specifically configured to:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability in the second software set, obtaining at least one binary content from the at least two binary contents as a second binary content to be processed;
obtaining the occurrence frequency of each binary content in the second binary content to be processed in the first software set from the occurrence frequency of each binary content in the at least two binary contents in the first software set, and obtaining the occurrence frequency of each binary content in the second binary content to be processed in the second software set from the occurrence frequency of each binary content in the at least two binary contents in the second software set;
and obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set.
In one possible design, the first obtaining module 802 obtains at least one piece of binary content from the at least two pieces of binary content as the second binary content to be processed according to the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set and the occurrence probability in the second software set, respectively, and includes:
obtaining the binary content with the occurrence probability lower than a fifth preset probability threshold in the first software set from the at least two sections of binary content as a fifth binary content;
obtaining the occurrence probability of each binary content in the fifth binary content in the second software set from the occurrence probability of each binary content in the at least two binary contents in the second software set;
and obtaining the binary content with the occurrence probability higher than a sixth preset probability threshold value in the second software set from the fifth binary content as the second binary content to be processed, wherein the fifth preset probability threshold value is smaller than the sixth preset probability threshold value.
In one possible design, the first obtaining module 802 obtains candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each binary content in the second binary contents to be processed in the first software set and the occurrence frequency of each binary content in the second software set, respectively, and includes:
obtaining the binary content with the occurrence frequency lower than a fifth preset frequency threshold value in the first software set from the second binary content to be processed as a sixth binary content;
obtaining the appearance frequency of each binary content in the sixth binary content in the second software set from the appearance frequency of each binary content in the at least two binary contents in the second software set;
and obtaining binary contents with the occurrence frequency higher than a sixth preset frequency threshold value in the second software set from the sixth binary contents as candidate binary contents, wherein the fifth preset frequency threshold value is smaller than the sixth preset frequency threshold value.
In one possible design, when the second type of statistical indicator includes the software coverage ratio, the second obtaining module 803 is specifically configured to:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage ratio of each binary content in the seventh binary content in the second software set from the software coverage ratio of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
In one possible design, when the second type of statistical indicator includes the set similarity, the second obtaining module 803 is specifically configured to:
obtaining binary contents with the set similarity lower than a first preset similarity threshold in the first software set from the candidate binary contents as eighth binary contents;
acquiring the set similarity of each binary content in the eighth binary content in the second software set from the set similarity of each binary content in the candidate binary content in the second software set;
and obtaining the binary content with the set similarity higher than a second preset similarity threshold in the second software set from the eighth binary content as the target binary content, wherein the first preset similarity threshold is smaller than the second preset similarity threshold.
In one possible design, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module 803 is specifically configured to:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the candidate binary content in the second software set;
acquiring set similarity of each segment of binary content in the first characteristic binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and acquiring set similarity of each segment of binary content in the first characteristic binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the first characteristic binary content according to the set similarity of each section of binary content in the first characteristic binary content in the first software set and the set similarity of each section of binary content in the second software set.
In one possible design, the second obtaining module 803 obtains at least one piece of binary content from the candidate binary content as the first feature binary content according to the software coverage ratio of each piece of binary content in the candidate binary content in the first software set and the software coverage ratio in the second software set, respectively, and includes:
obtaining binary contents with the software coverage proportion lower than a third preset proportion threshold value in the first software set from the candidate binary contents as ninth binary contents;
obtaining the software coverage proportion of each piece of binary content in the ninth binary content in the second software set from the software coverage proportion of each piece of binary content in the candidate binary content in the second software set;
obtaining, from the ninth binary content, a binary content in the second software set whose software coverage ratio is higher than a fourth preset ratio threshold as the first feature binary content, where the third preset ratio threshold is smaller than the fourth preset ratio threshold.
In one possible design, the second obtaining module 803 obtains the target binary content from the first feature binary content according to the set similarity of each piece of binary content in the first feature binary content in the first software set and the set similarity of the second software set, respectively, including:
obtaining binary contents with the set similarity lower than a third preset similarity threshold in the first software set from the first characteristic binary contents as twelfth binary contents;
obtaining the set similarity of each piece of binary content in the twelfth binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
obtaining, from the twelfth binary content, a binary content in the second software set, where the set similarity is higher than a fourth preset similarity threshold, as the target binary content, where the third preset similarity threshold is smaller than the fourth preset similarity threshold.
In one possible design, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module 803 is specifically configured to:
acquiring at least one section of binary content from the candidate binary content as second characteristic binary content according to the set similarity of each section of binary content in the candidate binary content in the first software set and the set similarity of each section of binary content in the second software set;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each piece of binary content in the second characteristic binary content in the first software set and the software coverage proportion of the second software set.
In one possible design, the second obtaining module 803 obtains at least one piece of binary content from the candidate binary content as the second feature binary content according to the set similarity of each piece of binary content in the candidate binary content in the first software set and the set similarity of each piece of binary content in the second software set, respectively, and includes:
obtaining binary contents with the set similarity lower than a fifth preset similarity threshold in the first software set from the candidate binary contents as eleventh binary contents;
obtaining the set similarity of each piece of binary content in the eleventh binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
obtaining, from the eleventh binary content, a binary content in the second software set whose set similarity is higher than a sixth preset similarity threshold as the second feature binary content, where the fifth preset similarity threshold is smaller than the sixth preset similarity threshold.
In one possible design, the second obtaining module 803 obtains the target binary content from the second characteristic binary content according to the software coverage ratio of each piece of binary content in the second characteristic binary content in the first software set and the software coverage ratio in the second software set, respectively, including:
obtaining binary contents with the software coverage proportion lower than a fifth preset proportion threshold value in the first software set from the second characteristic binary contents as twelfth binary contents;
obtaining the software coverage proportion of each piece of binary content in the twelfth binary content in the second software set from the software coverage proportion of each piece of binary content in the candidate binary content in the second software set;
obtaining, from the twelfth binary content, a binary content in the second software set, where the software coverage ratio is higher than a sixth preset ratio threshold, as the target binary content, where the fifth preset ratio threshold is smaller than the sixth preset ratio threshold.
In one possible design, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the second obtaining module 803 is further configured to:
determining information entropy values respectively corresponding to each segment of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each segment of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
In one possible design, if the candidate binary content includes at least two pieces of binary content, the first obtaining module 802 is further configured to obtain at least two pieces of binary content from the binary content, and after the candidate binary content is used as the candidate binary content:
determining an importance ordering rule according to the attribute information corresponding to each section of binary content in the candidate binary content;
according to the importance ranking rule, ranking at least two sections of binary contents in the candidate binary contents according to the importance degree from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
A possible design, the attribute information includes at least one of a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, an occurrence probability of the candidate binary content in the first software set, an occurrence probability of the candidate binary content in the second software set, a position and a mean, variance and entropy of the position where the candidate binary content first appears in each binary software of the first software set, a mean, variance and entropy of the position where the candidate binary content first appears in each binary software of the second software set, and printable characters of the candidate binary content.
One possible design, the importance ranking rule includes, in order of importance from high to low:
the number of the binary contents at the same position of the second software set is higher than a first preset number threshold;
the variance of the positions of binary contents appearing in each binary software of the first software set is smaller than a first preset variance threshold, the variance of the positions of corresponding binary contents appearing in each binary software of the second software set is smaller than a second preset variance threshold, the entropy of the positions of corresponding binary contents appearing in each binary software of the first software set is smaller than a first preset entropy threshold, the entropy of the positions of corresponding binary contents appearing in each binary software of the second software set is smaller than a second preset entropy threshold, and the ratio of the occurrence probability of corresponding binary contents in the first software set to the occurrence probability in the second software set is larger than a preset ratio threshold;
the distance between the position where the binary content appears for the first time in each binary software of the first software set and the file head is lower than a first distance threshold value, and the distance between the position where the corresponding binary content appears for the first time in each binary software of the second software set and the file head is lower than a second distance threshold value;
the number of binary contents in the second software set is higher than a second preset number threshold, the variance of the position of the corresponding binary content in each binary software of the second software set is lower than a third preset variance threshold, the variance of the position of the corresponding binary content appearing in each binary software of the second software set is smaller than a fourth preset variance threshold, and the entropy of the position of the corresponding binary content appearing in each binary software of the second software set is smaller than a third preset entropy threshold;
the number of binary contents in the first software set is lower than a third preset number threshold, the variance of the positions of the corresponding binary contents appearing in each binary software of the first software set is greater than a fifth preset variance threshold, and the entropy of the positions of the corresponding binary contents appearing in each binary software of the first software set is greater than a fourth preset entropy threshold;
the number of printable characters of the binary content is greater than a fourth preset number threshold.
One possible design, the context character including a previous character and a next character, the information entropy value including a previous entropy value and a next entropy value;
the second obtaining module 803 determines, according to the occurrence frequency of the contextual characters in the binary software of the second software set in each of the candidate binary contents, information entropy values respectively corresponding to each of the candidate binary contents, including:
determining the upper entropy value corresponding to each segment of binary content in the candidate binary content according to the ratio of the occurrence frequency of each segment of binary content in the binary software of the second software set to the occurrence frequency of all characters in the corresponding upper characters;
and determining the context entropy value corresponding to each section of binary content in the candidate binary content according to the proportion of the occurrence frequency of each context character in the binary software of the second software set in the occurrence frequency of all characters in the corresponding context character.
The apparatus of this embodiment may be correspondingly used to implement the technical solutions in the embodiments shown in the foregoing methods, and the implementation principles, implementation details, and technical effects thereof are similar and will not be described herein again.
Alternatively, fig. 9 schematically provides one possible basic hardware architecture for a computing device as described herein.
Referring to fig. 9, computing device 900 includes a processor 901, memory 902, a communication interface 903, and a bus 904.
The computing device 900 may be a computer or a server, which is not particularly limited in this application. In the computing device 900, the number of the processors 901 may be one or more, and fig. 9 only illustrates one of the processors 901. Alternatively, the processor 901 may be a Central Processing Unit (CPU). If the computing device 900 has multiple processors 901, the types of the multiple processors 901 may be different, or may be the same. Alternatively, the processors 901 of the computing device 900 may also be integrated into a multi-core processor.
Memory 902 stores computer instructions and data; the memory 902 may store computer instructions and data necessary to implement the above-described document signature extraction methods provided herein, e.g., the memory 902 stores instructions for implementing the steps of the above-described document signature extraction methods. Memory 902 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.
The communication interface 903 may be any one or any combination of the following devices: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.
The communication interface 903 is used for the computing device 900 to communicate data with other computing devices or terminals.
Fig. 9 shows the bus 904 by a thick line. The bus 904 may connect the processor 901 with the memory 902 and the communication interface 903. Thus, via bus 904, processor 901 may access memory 902 and may also interact with other computing devices or terminals using communication interface 903.
In the present application, the computing device 900 executes computer instructions in the memory 902, so that the computing device 900 implements the above-mentioned file signature extraction method provided herein, or so that the computing device 900 deploys the above-mentioned file signature extraction apparatus.
The file signature extracting apparatus may be implemented by software as shown in fig. 9, or may be implemented by hardware as a hardware module or a circuit unit.
The present application provides a computer-readable storage medium, the computer program product comprising computer instructions that instruct a computing device to perform the above-mentioned document signature extraction method provided herein.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Claims (20)

1. A file signature extraction method is characterized by comprising the following steps:
extracting at least two pieces of binary content from a first set of software and a second set of software, wherein the first set of software contains a first number of non-malware binary software and the second set of software contains a second number of malware binary software;
obtaining at least one section of binary content from the at least two sections of binary content as candidate binary content according to a first type of statistical indicator of each section of binary content in the at least two sections of binary content in the first software set and the first type of statistical indicator in the second software set, wherein the first type of statistical indicator comprises at least one of occurrence frequency and occurrence probability;
obtaining at least one piece of binary content from the candidate binary content as target binary content according to a second class of statistical indexes of each piece of binary content in the candidate binary content in the first software set and the second class of statistical indexes in the second software set, wherein the second class of statistical indexes comprises at least one of software coverage ratio and set similarity;
and obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
2. The method of claim 1, wherein extracting at least two pieces of binary content from a first software set and a second software set comprises:
and respectively extracting binary contents with preset lengths for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
3. The method according to claim 1, wherein when the first type of statistical indicator includes the frequency of occurrence, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each piece of binary content in the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set respectively comprises:
obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the occurrence frequency of each piece of binary content in the first binary content in the second software set from the occurrence frequency of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold is smaller than the second preset frequency threshold.
4. The method according to claim 1, wherein when the first type of statistical indicator includes the occurrence probability, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each piece of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, comprises:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in the first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
obtaining the binary content with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary content as the candidate binary content, wherein the first preset probability threshold is smaller than the second preset probability threshold.
5. The method according to claim 1, wherein when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set comprises:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in the first software set and the occurrence frequency in the second software set respectively, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
obtaining the candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents to be processed in the first software set and the occurrence probability of each binary contents to be processed in the second software set.
6. The method according to claim 1, wherein when the second type of statistical indicator includes the software coverage ratio, the obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set respectively comprises:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage ratio of each piece of binary content in the seventh binary content in the second software set from the software coverage ratio of each piece of binary content in the candidate binary content in the second software set;
obtaining the binary content with the software coverage ratio higher than a second preset ratio threshold in the second software set from the seventh binary content as the target binary content, wherein the first preset ratio threshold is smaller than the second preset ratio threshold.
7. The method according to claim 1, wherein when the second type of statistical indicator includes the set similarity, the obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set respectively comprises:
obtaining binary contents with the set similarity lower than a first preset similarity threshold in the first software set from the candidate binary contents as eighth binary contents;
obtaining the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
obtaining, from the eighth binary content, a binary content in the second software set, where the set similarity is higher than a second preset similarity threshold, as the target binary content, where the first preset similarity threshold is smaller than the second preset similarity threshold.
8. The method according to claim 1, wherein when the second type of statistical indicator includes the software coverage ratio and the set similarity, the obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set respectively comprises:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set;
obtaining the set similarity of each segment of binary content in the first feature binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each segment of binary content in the first feature binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the first characteristic binary content according to the set similarity of each piece of binary content in the first characteristic binary content in the first software set and the set similarity of each piece of binary content in the second software set.
9. The method according to any one of claims 1 to 8, wherein at least one piece of binary content is obtained from the candidate binary content, and before being used as the target binary content, the method further comprises:
determining information entropy values respectively corresponding to each segment of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each segment of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
10. The method according to any one of claims 1 to 9, wherein if at least two pieces of binary content are included in the candidate binary content, the method further comprises, after obtaining the at least two pieces of binary content from the binary content as candidate binary content:
determining an importance ranking rule according to attribute information corresponding to each section of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following: a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, a probability of occurrence of the candidate binary content in the first software set, a probability of occurrence of the candidate binary content in the second software set, a location of the candidate binary content at a first occurrence in each binary software of the first software set and a mean, variance and entropy of the location, a location of the candidate binary content at a first occurrence in each binary software of the second software set and a mean, variance and entropy of the location, and printable characters of the candidate binary content;
sequencing at least two sections of binary contents in the candidate binary contents from high to low according to the importance sequencing rule;
after the obtaining at least one piece of binary content from the candidate binary content as the target binary content, further comprising:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
11. A document signature extraction device, comprising:
an extraction module, configured to extract at least two pieces of binary content from a first software set and a second software set, where the first software set includes a first number of non-malicious binary softwares and the second software set includes a second number of malicious binary softwares;
a first obtaining module, configured to obtain at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical indicator of each of the at least two pieces of binary content in the first software set and the first type of statistical indicator in the second software set, where the first type of statistical indicator includes at least one of an occurrence frequency and an occurrence probability;
a second obtaining module, configured to obtain at least one piece of binary content from the candidate binary content as a target binary content according to a second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, where the second type of statistical indicator includes at least one of a software coverage ratio and a set similarity;
and the third obtaining module is used for obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
12. The apparatus according to claim 11, wherein the extraction module is specifically configured to:
and respectively extracting binary contents with preset lengths for each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
13. The apparatus according to claim 11, wherein when the first type of statistical indicator includes the frequency of occurrence, the first obtaining module is specifically configured to:
obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the occurrence frequency of each piece of binary content in the first binary content in the second software set from the occurrence frequency of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
obtaining binary contents with the occurrence frequency higher than a second preset frequency threshold in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold is smaller than the second preset frequency threshold.
14. The apparatus according to claim 11, wherein when the first type of statistical indicator includes the occurrence probability, the first obtaining module is specifically configured to:
obtaining the binary content with the occurrence probability lower than a first preset probability threshold in the first software set from the at least two sections of binary content as a second binary content;
obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
obtaining the binary content with the occurrence probability higher than a second preset probability threshold in the second software set from the second binary content as the candidate binary content, wherein the first preset probability threshold is smaller than the second preset probability threshold.
15. The apparatus according to claim 11, wherein when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module is specifically configured to:
obtaining at least one section of binary content from the at least two sections of binary content according to the occurrence frequency of each section of binary content in the at least two sections of binary content in the first software set and the occurrence frequency in the second software set respectively, and using the at least one section of binary content as a first binary content to be processed;
obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
obtaining the candidate binary contents from the first binary contents to be processed according to the occurrence probability of each binary contents to be processed in the first software set and the occurrence probability of each binary contents to be processed in the second software set.
16. The apparatus according to claim 11, wherein when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module is specifically configured to:
obtaining at least one section of binary content from the candidate binary content as first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set;
obtaining the set similarity of each segment of binary content in the first feature binary content in the first software set from the set similarity of each segment of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each segment of binary content in the first feature binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and obtaining the target binary content from the first characteristic binary content according to the set similarity of each piece of binary content in the first characteristic binary content in the first software set and the set similarity of each piece of binary content in the second software set.
17. The apparatus according to any of claims 11 to 16, wherein the second obtaining module is further configured to obtain at least one piece of binary content from the candidate binary content, before being used as the target binary content:
determining information entropy values respectively corresponding to each segment of binary content in the candidate binary content according to the occurrence frequency of the contextual characters of each segment of binary content in the binary software of the second software set;
and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
18. The apparatus according to any one of claims 11 to 17, wherein if at least two pieces of binary content are included in the candidate binary content, the first obtaining module is further configured to obtain at least two pieces of binary content from the binary content as candidate binary content, and after the candidate binary content is obtained:
determining an importance ranking rule according to attribute information corresponding to each section of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following: a number of the candidate binary content in the first software set, a number of the candidate binary content in the second software set, a probability of occurrence of the candidate binary content in the first software set, a probability of occurrence of the candidate binary content in the second software set, a location of the candidate binary content at a first occurrence in each binary software of the first software set and a mean, variance and entropy of the location, a location of the candidate binary content at a first occurrence in each binary software of the second software set and a mean, variance and entropy of the location, and printable characters of the candidate binary content;
sequencing at least two sections of binary contents in the candidate binary contents from high to low according to the importance sequencing rule;
after the obtaining at least one piece of binary content from the candidate binary content as the target binary content, further comprising:
determining the sequencing result of the target binary content according to the sequencing result of the candidate binary content;
and deleting the binary contents with the sequencing sequence numbers after the preset sequence numbers from the target binary contents.
19. A computing device, comprising:
comprises a processor and a memory;
the memory to store computer instructions;
the processor, configured to execute the computer instructions stored by the memory, to cause the computing device to perform the method of any of claims 1 to 10.
20. A computer program product, characterized in that it comprises computer instructions for instructing a computing device to perform the method of any of claims 1 to 10.
CN201911295341.5A 2019-12-16 2019-12-16 File signature extraction method and device Pending CN112989432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911295341.5A CN112989432A (en) 2019-12-16 2019-12-16 File signature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911295341.5A CN112989432A (en) 2019-12-16 2019-12-16 File signature extraction method and device

Publications (1)

Publication Number Publication Date
CN112989432A true CN112989432A (en) 2021-06-18

Family

ID=76343383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911295341.5A Pending CN112989432A (en) 2019-12-16 2019-12-16 File signature extraction method and device

Country Status (1)

Country Link
CN (1) CN112989432A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN107145780A (en) * 2017-03-31 2017-09-08 腾讯科技(深圳)有限公司 Malware detection method and device
CN107222511A (en) * 2017-07-25 2017-09-29 深信服科技股份有限公司 Detection method and device, computer installation and the readable storage medium storing program for executing of Malware

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination
CN107145780A (en) * 2017-03-31 2017-09-08 腾讯科技(深圳)有限公司 Malware detection method and device
CN107222511A (en) * 2017-07-25 2017-09-29 深信服科技股份有限公司 Detection method and device, computer installation and the readable storage medium storing program for executing of Malware

Similar Documents

Publication Publication Date Title
CN110099059B (en) Domain name identification method and device and storage medium
CN106919555B (en) System and method for field extraction of data contained within a log stream
US11470097B2 (en) Profile generation device, attack detection device, profile generation method, and profile generation computer program
US10547618B2 (en) Method and apparatus for setting access privilege, server and storage medium
CN110383278A (en) The system and method for calculating event for detecting malice
US10454967B1 (en) Clustering computer security attacks by threat actor based on attack features
CN107408115B (en) Web site filter, method and medium for controlling access to content
US11799863B2 (en) Creation device, creation system, creation method, and creation program
US11689547B2 (en) Information analysis system, information analysis method, and recording medium
US11270001B2 (en) Classification apparatus, classification method, and classification program
US20160321254A1 (en) Unsolicited bulk email detection using url tree hashes
CN106030527B (en) By the system and method for application notification user available for download
CN114969840A (en) Data leakage prevention method and device
US11423099B2 (en) Classification apparatus, classification method, and classification program
CN112926647B (en) Model training method, domain name detection method and domain name detection device
CN111368128B (en) Target picture identification method, device and computer readable storage medium
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
JP7031438B2 (en) Information processing equipment, control methods, and programs
CN112989432A (en) File signature extraction method and device
CN115589339A (en) Network attack type identification method, device, equipment and storage medium
CN113452700B (en) Method, device, equipment and storage medium for processing safety information
CN115225328A (en) Page access data processing method and device, electronic equipment and storage medium
US11662927B2 (en) Redirecting access requests between access engines of respective disk management devices
CN110197066B (en) Virtual machine monitoring method and system in cloud computing environment
JP7140268B2 (en) WARNING DEVICE, CONTROL METHOD AND PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination