CN112989432B - File signature extraction method and device - Google Patents

File signature extraction method and device Download PDF

Info

Publication number
CN112989432B
CN112989432B CN201911295341.5A CN201911295341A CN112989432B CN 112989432 B CN112989432 B CN 112989432B CN 201911295341 A CN201911295341 A CN 201911295341A CN 112989432 B CN112989432 B CN 112989432B
Authority
CN
China
Prior art keywords
binary
software
content
binary content
contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911295341.5A
Other languages
Chinese (zh)
Other versions
CN112989432A (en
Inventor
鞠全永
朱晓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201911295341.5A priority Critical patent/CN112989432B/en
Publication of CN112989432A publication Critical patent/CN112989432A/en
Application granted granted Critical
Publication of CN112989432B publication Critical patent/CN112989432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6209Protecting access to data via a platform, e.g. using keys or access control rules to a single file or object, e.g. in a secure envelope, encrypted and accessed using a key, or with access control rules appended to the object itself

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The application provides a file signature extraction method and a device, wherein the method comprises the following steps: at least two pieces of binary contents are extracted from a non-malicious binary software set and a malicious binary software set, then at least one piece of binary content is obtained from the binary contents according to first type statistical indexes of the at least two pieces of binary contents in the two sets respectively to serve as candidate binary contents, further, target binary contents are obtained from the candidate binary contents according to second type statistical indexes of the candidate binary contents in the two sets respectively, and further, signature of the malicious binary software is obtained according to the target binary contents, so that the automatic technology is utilized to extract part of binary contents of the malicious software as signature for identifying the malicious software, the problems that workload of analysis personnel of the malicious software is overlarge and signature extraction efficiency is low are solved, the influence of personal experience and subjective factors of the analysis personnel is avoided, and the extraction accuracy of the malicious software signature is improved.

Description

File signature extraction method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for extracting a file signature.
Background
With the rapid development of the internet, the network security problem is gradually highlighted, wherein malicious software represented by Trojan horse, virus, backdoor program, advertisement software and the like has been developed rapidly in quantity, update speed, use technology and the like compared with the prior art, and the influence and loss to internet users are also increased year by year.
To address the above, malware is generally identified based on signatures. Typically, a malware analyst extracts content such as a character string, assembler instructions, etc. of malware as a signature for identifying the malware based on a study of the malware.
However, the amount of malware grows very rapidly, and when the amount of malware that a malware analyst can analyze is several orders of magnitude less than the amount of malware that requires manual reverse engineering, identifying signatures, the speed of analysis is much slower than the speed of malware growth. The workload of malicious software analysts is excessive, the efficiency of extracting signatures is low, and the requirements cannot be met. In addition, the manual analysis and extraction of the signature is greatly influenced by personal experience and concentration of the analyst, and the problem of low accuracy exists.
Disclosure of Invention
The application provides a file signature extraction method and a file signature extraction device, which are used for solving the problem of lower efficiency when a manual analysis method is adopted to extract a malicious software signature.
In a first aspect, an embodiment of the present application provides a method for extracting a file signature, which may be performed by an analysis device, the method including the steps of: first, at least two pieces of binary content are extracted from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software. Here, the malware includes Trojan horse, virus, backgate program, advertisement software, etc., and the non-malicious binary software may also be referred to as normal binary software. The first number and the second number may be set according to practical situations, and the embodiment of the present application is not particularly limited thereto. And secondly, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type statistical index of each piece of binary content in a first software set and a first type statistical index in a second software set, wherein the first type statistical index comprises at least one of occurrence frequency and occurrence probability. And obtaining at least one section of binary content from the candidate binary content as target binary content according to a second type of statistical index of each section of binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity. And finally, obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
According to the embodiment of the application, at least two pieces of binary contents are extracted from the first software set and the second software set, and then at least one piece of binary content is obtained from the at least two pieces of binary contents as candidate binary contents according to the first type of statistical indexes of the at least two pieces of binary contents in the first software set and the first type of statistical indexes in the second software set, and further, the target binary content is obtained from the candidate binary contents according to the second type of statistical indexes of the candidate binary contents in the first software set and the second type of statistical indexes in the second software set, and further, the corresponding signature of the malicious software is obtained according to the target binary content, so that under the condition of no need of relying on manual analysis, the problem that the workload of a malicious software analysis personnel is overlarge and the efficiency of extracting the signature is low is solved, the extraction efficiency of the signature of the malicious software is improved, and the application requirement is met. And because the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of analysts, the signature extraction accuracy of malicious software is improved to a certain extent.
A possible design, the extracting at least two pieces of binary content from the first software set and the second software set, includes:
And extracting binary contents with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
For example, a window of k bytes may be provided, sliding in a direction from left to right along the binary file (i.e., sliding from a low address to a high address in the memory space occupied by the binary file), with each sliding being shifted by one byte, where k is a natural number greater than 1. For each binary file in the first software set and the second software set, the window is slid from left to right, one unit is slid each time, and binary content with the size of k is extracted.
A possible design, when the first type of statistical indicator includes the occurrence frequency, obtains at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, respectively, where each piece of binary content is included in the at least two pieces of binary content, and the method includes:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in a first software set from the at least two sections of binary contents as first binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the frequency of occurrence, according to the frequency of occurrence of the at least two pieces of binary content in the non-malicious binary software of the first software set and the frequency of occurrence of the at least two pieces of binary content in the malicious binary software of the second software set, the binary content with the low frequency of occurrence of the non-malicious software and the high frequency of occurrence of the malicious software in the at least two pieces of binary content is selected as the candidate binary content.
A possible design, when the first type of statistical indicator includes the occurrence probability, obtains at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, where each piece of binary content is included in the at least two pieces of binary content, respectively, and the method includes:
obtaining binary contents with occurrence probability lower than a first preset probability threshold value in a first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the second software set;
And obtaining the binary contents with occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
Illustratively, if the first type of statistical indicator includes the occurrence probability, according to the occurrence probability of the at least two pieces of binary content in the non-malicious binary software of the first software set and the occurrence probability of the at least two pieces of binary content in the malicious binary software of the second software set, selecting, as the candidate binary content, the binary content with the low occurrence probability in the non-malicious software and the high occurrence probability in the malicious software from the at least two pieces of binary content.
A possible design, when the first type of statistical indicator includes the occurrence frequency and occurrence probability, obtains at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator of each piece of binary content in the first software set and the first type of statistical indicator in the second software set, respectively, including:
according to the occurrence frequency of each binary content in the at least two binary contents in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
And obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set.
Here, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content according to the occurrence frequency first as a first binary content to be processed, and further, a candidate binary content may be obtained from the first binary content to be processed according to the occurrence probability.
Also, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content according to the occurrence probability, as second binary content to be processed, and further, candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Specifically, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set of each piece of binary content in the at least two pieces of binary content, at least one piece of binary content is obtained from the at least two pieces of binary content and is used as a candidate binary content, the method includes:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability of each binary content in the second software set, at least one binary content is obtained from the at least two binary contents and used as second binary content to be processed;
Obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the first software set from the frequency of occurrence of each piece of binary content in the first software set, respectively, and obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set;
And obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each segment of binary contents in the second binary contents to be processed in the first software set and the occurrence frequency in the second software set.
A possible design, when the second type of statistical indicator includes the software coverage ratio, obtains at least one piece of binary content from the candidate binary content as a target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where each piece of binary content in the candidate binary content is a first piece of binary content, and includes:
Obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in a first software set from the candidate binary contents as seventh binary contents;
Obtaining the software coverage proportion of each segment of binary content in the seventh binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary contents as target binary contents, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting, as the target binary content, a binary content having a low software coverage ratio among non-malicious software and a high software coverage ratio among malicious software from the candidate binary content according to the software coverage ratio among the non-malicious binary software of the first software set and the software coverage ratio among the malicious binary software of the second software set.
A possible design, when the second type of statistical indicator includes the set similarity, obtains at least one piece of binary content from the candidate binary content as a target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where each piece of binary content in the candidate binary content is a piece of binary content, and the method includes:
obtaining binary contents with set similarity lower than a first preset similarity threshold value in a first software set from the candidate binary contents as eighth binary contents;
Acquiring the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
For example, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting, as the target binary content, the binary content with low set similarity in the non-malicious software and high set similarity in the malicious software from the candidate binary contents.
A possible design, when the second type of statistical indicator includes the software coverage ratio and the set similarity, obtains at least one piece of binary content from the candidate binary content as a target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where the second type of statistical indicator includes the software coverage ratio and the set similarity, and includes:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Acquiring the set similarity of each piece of binary content in the first characteristic binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and acquiring the set similarity of each piece of binary content in the first characteristic binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
and obtaining target binary contents from the first characteristic binary contents according to the set similarity of each piece of binary contents in the first characteristic binary contents in the first software set and the set similarity in the second software set.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary contents according to the software coverage ratio as the first feature binary content, and further, the target binary content may be obtained from the first feature binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content according to the set similarity as the second feature binary content, and further, the target binary content may be obtained from the second feature binary content according to the software coverage ratio.
Specifically, when the second type of statistical indicator includes the software coverage ratio and the set similarity, according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, at least one piece of binary content is obtained from the candidate binary content as a target binary content, including:
according to the set similarity of each segment of binary content in the candidate binary content in the first software set and the set similarity of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as second characteristic binary content;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set and the software coverage proportion in the second software set respectively.
In addition, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining information entropy values corresponding to each piece of binary content in the candidate binary content according to the occurrence frequency of context characters in the binary software of the second software set;
And deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value is, the more random the content of the context is, and the lower the information entropy value is, the more consistent the content of the context is. If the entropy of the information of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold, the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, and the candidate binary content is deleted (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set). Here, the above-mentioned preset entropy threshold value may be set according to actual conditions, and the present application is not particularly limited thereto.
One possible design, if the candidate binary content includes at least two pieces of binary content, then obtaining at least two pieces of binary content from the binary content, where after the binary content is used as the candidate binary content, the method further includes:
Determining an importance ranking rule according to attribute information corresponding to each piece of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following components: the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the position of the first occurrence of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the position of the first occurrence of the candidate binary contents in each binary software of the second software set, and the printable character of the candidate binary contents;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
And deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
Here, according to the attribute information of the candidate binary contents, an importance ranking rule is determined, then the candidate binary contents are ranked from high importance to low importance according to the importance ranking rule, and further, according to the ranking result of the candidate binary contents, the ranking result of the target binary contents is determined, and the binary contents with the ranking sequence numbers behind the preset sequence numbers are deleted from the target binary contents, so that signatures of corresponding malicious binary software can be generated according to binary contents with higher importance, and the generated signatures can identify the malicious binary software more accurately.
In a second aspect, an embodiment of the present application provides a file signature extraction apparatus, including:
The extraction module is used for extracting at least two pieces of binary content from a first software set and a second software set, wherein the first software set contains a first number of non-malicious binary software, and the second software set contains a second number of malicious binary software;
the first obtaining module is used for obtaining at least one piece of binary content from the at least two pieces of binary content according to a first type of statistical index of each piece of binary content in the first software set and a first type of statistical index in the second software set, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability;
The second obtaining module is used for obtaining at least one piece of binary content from the candidate binary content according to a second type of statistical index of each piece of binary content in the candidate binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity;
and the third obtaining module is used for obtaining the signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
One possible design is that the extraction module is specifically configured to:
And extracting binary contents with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
A possible design, when the first type of statistical indicator includes the frequency of occurrence, the first obtaining module is specifically configured to:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in a first software set from the at least two sections of binary contents as first binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
A possible design, when the first type of statistical indicator includes the occurrence probability, the first obtaining module is specifically configured to:
Obtaining binary contents with occurrence probability lower than a first preset probability threshold value in a first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the second software set;
And obtaining the binary contents with occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
A possible design, when the first type of statistical indicator includes the occurrence frequency and occurrence probability, the first obtaining module is specifically configured to:
according to the occurrence frequency of each binary content in the at least two binary contents in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set.
A possible design, when the first type of statistical indicator includes the occurrence frequency and occurrence probability, the first obtaining module is specifically configured to:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability of each binary content in the second software set, at least one binary content is obtained from the at least two binary contents and used as second binary content to be processed;
Obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the first software set from the frequency of occurrence of each piece of binary content in the first software set, respectively, and obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set;
And obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each segment of binary contents in the second binary contents to be processed in the first software set and the occurrence frequency in the second software set.
A possible design, when the second type of statistical indicator includes the software coverage ratio, the second obtaining module is specifically configured to:
Obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
Obtaining the software coverage proportion of each segment of binary content in the seventh binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary contents as target binary contents, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
A possible design, when the second type of statistical indicator includes the set similarity, the second obtaining module is specifically configured to:
obtaining binary contents with set similarity lower than a first preset similarity threshold value in a first software set from the candidate binary contents as eighth binary contents;
Acquiring the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
A possible design, when the second type of statistical indicator includes the software coverage proportion and the aggregate similarity, the second obtaining module is specifically configured to:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Acquiring the set similarity of each piece of binary content in the first characteristic binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and acquiring the set similarity of each piece of binary content in the first characteristic binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
and obtaining target binary contents from the first characteristic binary contents according to the set similarity of each piece of binary contents in the first characteristic binary contents in the first software set and the set similarity of each piece of binary contents in the second software set.
A possible design, when the second type of statistical indicator includes the software coverage proportion and the aggregate similarity, the second obtaining module is specifically configured to:
according to the set similarity of each segment of binary content in the candidate binary content in the first software set and the set similarity of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as second characteristic binary content;
Obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set and the software coverage proportion in the second software set respectively.
One possible design is that the second obtaining module is further configured to, before obtaining at least one piece of binary content from the candidate binary content as the target binary content:
according to the occurrence frequency of the context characters in the binary software of the second software set, respectively, determining the information entropy value corresponding to each segment of binary content in the candidate binary content;
And deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
One possible design is that if the candidate binary content contains at least two pieces of binary content, the first obtaining module is further configured to, after obtaining at least two pieces of binary content from the binary content as the candidate binary content:
Determining an importance ranking rule according to attribute information corresponding to each piece of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following components: the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the position of the first occurrence of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the position of the first occurrence of the candidate binary contents in each binary software of the second software set, and the printable character of the candidate binary contents;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
Determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
And deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
In a third aspect, the present application provides a computing device comprising a processor and a memory. The memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform the method of the first aspect or the various possible designs of the first aspect described above, to cause the computing device to deploy the second aspect or the various possible designs of the second aspect described above to provide the file signature extraction apparatus.
In a fourth aspect, the present application provides a computer readable storage medium having stored therein computer instructions for instructing the computing device to execute the method of the above-described first aspect or the various possible designs of the first aspect or for instructing the computing device to deploy the above-described second aspect or the various possible designs of the second aspect to provide the file signature extraction device.
In a fifth aspect, the present application provides a computer program product comprising computer instructions. Optionally, the computer instructions are stored in a computer readable storage medium. The processor of the computing device may read the computer instructions from a computer-readable storage medium, the processor executing the computer instructions to cause the computing device to perform the method provided by the above-described first aspect or the various possible designs of the first aspect, such that the computing device deploys the file signature extraction apparatus provided by the above-described second aspect or the various possible designs of the second aspect.
In a sixth aspect, an embodiment of the present application provides a chip, including a memory for storing computer instructions, and a processor for calling and executing the computer instructions from the memory to perform the method of any possible implementation of the first aspect and the first aspect.
Drawings
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic diagram of another application scenario provided in an embodiment of the present application;
Fig. 3 is a schematic diagram of still another application scenario provided in an embodiment of the present application;
fig. 4 is a schematic flow chart of a file signature extraction method according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another method for extracting a file signature according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for extracting a file signature according to an embodiment of the present application;
Fig. 7 is a flowchart of another method for extracting a file signature according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a document signature extraction device according to the present application;
fig. 9 is a schematic diagram of a basic hardware architecture of a computing device according to the present application.
Detailed Description
The main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical solution of the embodiment of the present application are described in detail below with reference to the accompanying drawings. The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
The file signature extraction according to the embodiment of the application refers to extracting part of binary content of malicious software by using an automatic technology as a signature for identifying the malicious software. Among them, malware includes Trojan horse, virus, backdoor program, advertisement software, etc. According to the method, the signature is automatically extracted to identify the malicious software, so that the problems of overlarge workload of malicious software analysts and lower efficiency of extracting the signature are solved, and the application requirements are met.
The file signature extraction method and device provided by the embodiment of the application can be applied to servers, firewalls, gateway equipment or terminals taking a host as an example.
Optionally, the method and the device for extracting the file signature provided by the embodiment of the application can be applied to application scenes shown in fig. 1, fig. 2 and fig. 3. Fig. 1, fig. 2 and fig. 3 are only illustrative examples of three possible application scenarios of the file signature extraction method provided by the embodiment of the present application, and the application scenarios of the file signature extraction method provided by the embodiment of the present application are not limited to the application scenarios shown in fig. 1, fig. 2 and fig. 3.
Fig. 1 is a schematic diagram of an enterprise network architecture. In fig. 1, the enterprise network architecture includes an analysis device 101, a network access device 102, such as a firewall or security gateway, a switch 103 coupled to the network access device 102, and a plurality of hosts 104 coupled to the switch. Wherein the analysis device 101 is connected to the network access device 102. The analysis device 101 may be, for example, an intrusion prevention system (intrusion prevention system, IPS) device or a unified threat management (unified THREAT MANAGEMENT, UTM) device, or the like. The analysis device 101 is configured to extract a signature of malicious binary software, where the signature is used to identify the malicious binary software, and receive the malicious binary software sent by a firewall or a security gateway in the device 102, or receive the malicious binary software sent by the client software installed on the intranet host 104, and output the signature of the malicious binary software.
Fig. 2 is a schematic diagram of a cloud network architecture. In fig. 2, the cloud network architecture includes an analysis device 201 located on the core network side, and a plurality of firewall devices 202 in the access network. Wherein the analysis device 201 may be configured to extract a signature of the malicious binary software, the signature being configured to identify the malicious binary software, and to receive the malicious binary software from the firewall device 202 and output the signature of the malicious binary software.
Fig. 3 is a schematic diagram of a terminal architecture. In fig. 3, taking a terminal as an example of a mobile phone, the mobile phone actually carries the function of the analysis device 301, and the analysis device 301 may receive an operation instruction of a user to perform corresponding processing. Illustratively, the user may input an extraction instruction to the handset, from which the analysis device 301 extracts the signature of the malicious binary software. The user may also input an output instruction to the mobile phone, and the analysis device 301 outputs a signature of the malicious binary software according to the output instruction. Therefore, the malicious software is identified by automatically extracting the signature, the workload is low, the signature extracting efficiency is high, and the application requirement is met.
It should be understood that, the network architecture and the service scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and are not limited to the technical solution provided in the embodiments of the present application, and those skilled in the art can know that, with the evolution of the network architecture and the appearance of the new service scenario, the technical solution provided in the embodiments of the present application is equally applicable to similar technical problems.
The file signature extraction method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings. The subject of execution of the method may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3. The workflow of the analysis device 101, the analysis device 201 and the analysis device 301 mainly comprises an extraction phase and a selection phase. In the extraction phase, the analysis device 101, the analysis device 201 and the analysis device 301 extract at least two pieces of binary content from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software. In the selection stage, the analysis device 101, the analysis device 201 and the analysis device 301 obtain at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical index in a first software set and a first type of statistical index in a second software set, respectively, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability; and obtaining at least one piece of binary content from the candidate binary content as target binary content according to a second type of statistical index of each piece of binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity, and further, obtaining a signature of corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
The following description of the present application is given by taking several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 4 is a schematic flow chart of a file signature extraction method according to an embodiment of the present application, where an execution subject of the embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 4, the method may include the following steps.
S401: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software.
Here, the first software set may include a plurality of non-malicious binary software, and the first number may be set according to practical situations, which is not particularly limited by the embodiment of the present application. Similarly, the second software set may include a plurality of malicious binary software, where the second number may be set according to practical situations, and the embodiment of the present application is not limited in this way.
In the embodiment of the present application, the analysis device may receive a plurality of non-malicious binaries input by an external device (for example, the device with a firewall deployed thereon), or may obtain a plurality of non-malicious software from the non-malicious software stored in the memory, and specifically how to obtain the non-malicious binaries may be determined according to actual situations, which is not particularly limited in the embodiment of the present application.
The malicious binary software may be malicious binary software that requires signature extraction. The analysis device may receive a plurality of malicious binary software input by the external device, or may obtain a plurality of malicious software from the malicious software stored in the memory and needing signature extraction, and specifically how to obtain the malicious binary software may be determined according to the actual situation, which is not particularly limited in the embodiment of the present application.
In some possible embodiments, the extracting at least two pieces of binary content from the first software set and the second software set includes:
And extracting binary content extraction with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
The preset length may be set according to actual needs, and may be a fixed length or an unfixed length, which is not particularly limited in the embodiment of the present application.
Illustratively, for each binary file in the first software set and the second software set, a fixed length (e.g., 4 bytes) binary content is extracted in a sliding window manner.
Specifically, a window with k bytes may be set, and the sliding direction is to slide from left to right along the binary file (i.e. slide from a low address to a high address in the storage space occupied by the binary file), where k is a natural number greater than 1, and the displacement of each sliding is one byte. For each binary file in the first software set and the second software set, the window is slid from left to right, one unit is slid each time, and binary content with the size of k bytes is extracted.
The binary content of each binary file in the first software set and the second software set is quickly extracted through the sliding window, so that the method is simple and convenient, and meets the application requirements.
In addition, after the binary contents are extracted, the position and the number of occurrences of each extracted binary content in the binary file for the first time may be recorded, so that the binary contents meeting the requirements are obtained from the extracted binary contents according to the position and the number of occurrences in the subsequent processing.
S402: and obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical index of each piece of binary content in the first software set and a first type of statistical index in the second software set, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability.
Here, the frequency of occurrence may be the number of times the binary content occurs.
The occurrence probability may be a conditional probability. The conditional probability refers to the probability that the event A occurs under the condition that the event B occurs. The conditional probability is expressed as: p (A|B) represents the probability that A occurs under the conditions that B occurs. If there are only two events a, B, then,
Taking binary content abcd as an example, the probability of occurrence of a certain malware species is expressed as: p (abcd)
P (abcd) =p (a) P (b|a) P (c|b) P (d|c), where the probability of binary occurrence of abcd is the product of four probabilities, and P (a) represents the ratio of the number of occurrences of character a in the malware to the number of occurrences of all characters.
In some possible embodiments, when the first type of statistical indicator includes the occurrence frequency, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, where each piece of binary content is included in the at least two pieces of binary content, including:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in a first software set from the at least two sections of binary contents as first binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the frequency of occurrence, according to the frequency of occurrence of the at least two pieces of binary content in the non-malicious binary software of the first software set and the frequency of occurrence of the at least two pieces of binary content in the malicious binary software of the second software set, selecting, as the candidate binary content, binary content having low frequency of occurrence in the non-malicious software and high frequency of occurrence in the malicious software from the at least two pieces of binary content.
In some possible embodiments, when the first type of statistical indicator includes the occurrence probability, according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, for each of the at least two pieces of binary content, obtaining at least one piece of binary content from the at least two pieces of binary content as a candidate binary content, including:
obtaining binary contents with occurrence probability lower than a first preset probability threshold value in a first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the second software set;
And obtaining the binary contents with occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
Illustratively, if the first type of statistical indicator includes the occurrence probability, according to the occurrence probability of the at least two pieces of binary content in the non-malicious binary software of the first software set and the occurrence probability of the at least two pieces of binary content in the malicious binary software of the second software set, selecting, as the candidate binary content, the binary content with the low occurrence probability in the non-malicious software and the high occurrence probability in the malicious software from the at least two pieces of binary content.
In some possible embodiments, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, each piece of binary content in the at least two pieces of binary content, at least one piece of binary content is obtained from the at least two pieces of binary content as a candidate binary content, including:
according to the occurrence frequency of each binary content in the at least two binary contents in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
And obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set.
Here, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content according to the occurrence frequency first as a first binary content to be processed, and further, a candidate binary content may be obtained from the first binary content to be processed according to the occurrence probability.
Also, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content according to the occurrence probability, as second binary content to be processed, and further, candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Specifically, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set of each piece of binary content in the at least two pieces of binary content, at least one piece of binary content is obtained from the at least two pieces of binary content and is used as a candidate binary content, the method includes:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability of each binary content in the second software set, at least one binary content is obtained from the at least two binary contents and used as second binary content to be processed;
Obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the first software set from the frequency of occurrence of each piece of binary content in the first software set, respectively, and obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set;
And obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each segment of binary contents in the second binary contents to be processed in the first software set and the occurrence frequency in the second software set.
Here, for the specific parameters included in the first type of statistical indexes, at least one section of binary content is obtained from the at least two sections of binary content, and the specific technical means for serving as candidate binary content is refined, so that different application requirements in different application scenes can be met, and the method is suitable for application.
S403: and obtaining at least one section of binary content from the candidate binary content as target binary content according to a second type of statistical index of each section of binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity.
Here, the software coverage ratio of the candidate binary content in the first software set may be understood as a coverage ratio of the candidate binary content in the first software set, for example, 10 binary pieces in the first software set, and 5 binary pieces in the first software set are the candidate binary content, and the software coverage ratio of the candidate binary content in the first software set is 50%.
Similarly, the software coverage ratio of the candidate binary content in the second software set may be understood as the coverage ratio of the candidate binary content in the second software set.
The set similarity may be determined by a Jaccard coefficient, also known as Jaccard similarity coefficient (Jaccard similarity coefficient), which is used to compare similarity to variability between a finite set of samples. The larger the Jaccard coefficient value, the higher the sample similarity. Given two sets A and B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B, as follows:
Here, the set similarity of the candidate binary contents in the first software set may be understood as a similarity of each set of the candidate binary contents in the first software set, for example, 100 binary contents in the first software set, the candidate binary contents include a first binary content and a second binary content, the first binary content is present in 20 binary contents in the first software set, the 20 binary contents are present as a first set, the second binary content is present in 10 binary contents in the first software set, the 10 binary contents are present as a second set, and a similarity of the first set and the second set is determined according to the Jaccard coefficient, as a set similarity of the candidate binary contents in the first software set.
Similarly, the set similarity of the candidate binary content in the second software set may be understood as the similarity of each set of the candidate binary content appearing in the second software set.
In some possible embodiments, when the second type of statistical indicator includes the software coverage ratio, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where the piece of binary content is included in the candidate binary content, and the method includes:
Obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in a first software set from the candidate binary contents as seventh binary contents;
Obtaining the software coverage proportion of each segment of binary content in the seventh binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary contents as target binary contents, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting, as the target binary content, a binary content having a low software coverage ratio among non-malicious software and a high software coverage ratio among malicious software from the candidate binary content according to the software coverage ratio among the non-malicious binary software of the first software set and the software coverage ratio among the malicious binary software of the second software set.
In other possible embodiments, when the second type of statistical indicator includes the set similarity, obtaining at least one piece of binary content from the candidate binary content as the target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where each piece of binary content in the candidate binary content includes:
obtaining binary contents with set similarity lower than a first preset similarity threshold value in a first software set from the candidate binary contents as eighth binary contents;
Acquiring the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
For example, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting, as the target binary content, the binary content with low set similarity in the non-malicious software and high set similarity in the malicious software from the candidate binary contents.
In some other possible embodiments, when the second type of statistical indicator includes the software coverage ratio and the set similarity, according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, at least one piece of binary content is obtained from the candidate binary content as a target binary content, where the second type of statistical indicator includes:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Acquiring the set similarity of each piece of binary content in the first characteristic binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and acquiring the set similarity of each piece of binary content in the first characteristic binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
and obtaining target binary contents from the first characteristic binary contents according to the set similarity of each piece of binary contents in the first characteristic binary contents in the first software set and the set similarity in the second software set.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary contents according to the software coverage ratio as the first feature binary content, and further, the target binary content may be obtained from the first feature binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content according to the set similarity as the second feature binary content, and further, the target binary content may be obtained from the second feature binary content according to the software coverage ratio.
Specifically, when the second type of statistical indicator includes the software coverage ratio and the set similarity, according to the second type of statistical indicator of each piece of binary content in the candidate binary content in the first software set and the second type of statistical indicator in the second software set, at least one piece of binary content is obtained from the candidate binary content as a target binary content, including:
according to the set similarity of each segment of binary content in the candidate binary content in the first software set and the set similarity of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as second characteristic binary content;
obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set and the software coverage proportion in the second software set respectively.
Here, the specific technical means for obtaining the target binary content from the candidate binary content is refined according to the specific parameters included in the second type of statistical indexes, so that different application requirements in different application scenes can be met, and the method is suitable for application.
In addition, before obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further includes:
determining information entropy values corresponding to each piece of binary content in the candidate binary content according to the occurrence frequency of context characters in the binary software of the second software set;
And deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value is, the more random the content of the context is, and the lower the information entropy value is, the more consistent the content of the context is. If the entropy of the information of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold, the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, and the candidate binary content is deleted (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set). Here, the above-mentioned preset entropy threshold value may be set according to actual conditions, and the present application is not particularly limited thereto.
The information entropy is calculated as follows:
Where p i denotes the ratio of the number of occurrences of the ith character to the number of occurrences of all characters.
Here, after determining the information entropy value of the candidate binary content in the malicious binary software of the second software set, filtering the candidate binary content through the information entropy value, deleting the binary content with the information entropy value higher than the preset entropy value threshold from the candidate binary content, so that the subsequent processing result is more accurate, the subsequent processing is simpler, and the application requirement is met.
S404: and obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
Optionally, firstly, according to the corresponding relation between the target binary content and the malicious binary software in the second software set, determining the target binary content corresponding to each malicious binary software, then combining the target binary content corresponding to each malicious binary software, and taking the combined result as the signature of the corresponding malicious binary software for identifying the malicious binary software.
Optionally, the target binary content corresponding to each piece of malicious binary software is combined according to a preset combination requirement, where the preset combination requirement can be set according to an actual situation, and the comparison of the embodiment of the application is not particularly limited.
In some possible embodiments, if the candidate binary content includes at least two pieces of binary content, at least two pieces of binary content are obtained from the binary content, and after the candidate binary content is obtained, the method further includes:
Determining an importance ranking rule according to attribute information corresponding to each piece of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following components: the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the position of the first occurrence of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the position of the first occurrence of the candidate binary contents in each binary software of the second software set, and the printable character of the candidate binary contents;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
And deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
Here, according to the attribute information of the candidate binary contents, an importance ranking rule is determined, then the candidate binary contents are ranked from high importance to low importance according to the importance ranking rule, and further, according to the ranking result of the candidate binary contents, the ranking result of the target binary contents is determined, and the binary contents with the ranking sequence numbers behind the preset sequence numbers are deleted from the target binary contents, so that signatures of corresponding malicious binary software can be generated according to binary contents with higher importance, and the generated signatures can identify the malicious binary software more accurately.
According to the embodiment of the application, at least two sections of binary contents are extracted from the first software set and the second software set, and then the target binary contents are obtained from the candidate binary contents according to the first type of statistical indexes of the at least two sections of binary contents in the first software set and the first type of statistical indexes of the at least two sections of binary contents in the second software set, so that the problem that the workload of software analysts is overlarge and the efficiency of extracting signatures is lower is solved, the extraction efficiency of the signatures of the malicious software is improved, and the application requirements are met. And because the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of analysts, the signature extraction accuracy of malicious software is improved to a certain extent.
Fig. 5 is a schematic flow chart of another method for extracting a file signature according to an embodiment of the present application, where an execution subject of the embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and the specific execution subject may be determined according to an actual application scenario. As shown in fig. 5, the method may include:
s501: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software.
The implementation manner of step S501 is the same as that of step S401, and will not be described here again.
Optionally, the analysis device alternatively performs one or more sub-processes: the sub-flow of steps S502 to S504, or the sub-flow of steps S505 to S507, or the sub-flow of steps S508 to S510.
S502: when the first type of statistical index comprises the occurrence frequency, obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents.
S503: and obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content.
S504: and obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
Here, if the first type of statistical indicator includes the frequency of occurrence, according to the frequency of occurrence of the at least two pieces of binary content in the non-malicious binary software of the first software set and the frequency of occurrence of the at least two pieces of binary content in the malicious binary software of the second software set, the binary content with the low frequency of occurrence of the non-malicious software and the high frequency of occurrence of the malicious software in the at least two pieces of binary content is selected as the candidate binary content.
S505: and when the first type of statistical indexes comprise occurrence probabilities, obtaining binary contents with occurrence probabilities lower than a first preset probability threshold value in the first software set from the at least two sections of binary contents as second binary contents.
S506: and obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set.
S507: and obtaining the binary contents with occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
Illustratively, if the first type of statistical indicator includes the occurrence probability, according to the occurrence probability of the at least two pieces of binary content in the non-malicious binary software of the first software set and the occurrence probability of the at least two pieces of binary content in the malicious binary software of the second software set, selecting, as the candidate binary content, the binary content with the low occurrence probability in the non-malicious software and the high occurrence probability in the malicious software from the at least two pieces of binary content.
S508: when the first type of statistical index comprises the occurrence frequency and the occurrence probability, at least one piece of binary content is obtained from the at least two pieces of binary content according to the occurrence frequency of each piece of binary content in the first software set and the occurrence frequency of each piece of binary content in the second software set, and the at least one piece of binary content is used as first binary content to be processed.
In some possible embodiments, the frequency of occurrence of each piece of binary content in the first software set and the frequency of occurrence in the second software set according to the at least two pieces of binary content respectively obtain at least one piece of binary content from the at least two pieces of binary content, as the first binary content to be processed, including:
obtaining binary contents with the occurrence frequency lower than a third preset frequency threshold value in the first software set from the at least two sections of binary contents as third binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the third binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining binary contents with occurrence frequency higher than a fourth preset frequency threshold value in the second software set from the third binary contents as the first binary contents to be processed, wherein the third preset frequency threshold value is smaller than the fourth preset frequency threshold value.
Here, according to the occurrence frequency of the at least two pieces of binary contents in the non-malicious binary software of the first software set and the occurrence frequency of the at least two pieces of binary contents in the malicious binary software of the second software set, the binary contents with low occurrence frequency in the non-malicious software and high occurrence frequency in the malicious software are selected from the at least two pieces of binary contents as the first binary contents to be processed.
S509: the occurrence probability of each binary content in the first binary content to be processed in the first software set is obtained from the occurrence probability of each binary content in the at least two binary content in the first software set, and the occurrence probability of each binary content in the first binary content to be processed in the second software set is obtained from the occurrence probability of each binary content in the at least two binary content in the second software set.
S510: and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set.
In some possible embodiments, the obtaining candidate binary contents from the first binary content to be processed according to the occurrence probability of each piece of binary content in the first software set and the occurrence probability in the second software set, respectively, includes:
Obtaining binary contents with occurrence probability lower than a third preset probability threshold value in the first software set from the first binary contents to be processed as fourth binary contents;
Obtaining the occurrence probability of each piece of binary content in the fourth binary content in the second software set from the occurrence probability of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with occurrence probability higher than a fourth preset probability threshold value in the second software set from the fourth binary contents as candidate binary contents, wherein the third preset probability threshold value is smaller than the fourth preset probability threshold value.
Here, according to the occurrence probability of the first binary content to be processed in the non-malicious binary software of the first software set and the occurrence probability of the first binary content to be processed in the malicious binary software of the second software set, the binary content with low occurrence probability in the non-malicious software and high occurrence probability in the malicious software is selected from the first binary content to be processed as the candidate binary content.
In addition, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtains at least one piece of binary content from the at least two pieces of binary content according to the occurrence frequency, and further obtains candidate binary content from the first piece of binary content to be processed according to the occurrence probability.
Also, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, at least one piece of binary content may be obtained from the at least two pieces of binary content according to the occurrence probability, as second binary content to be processed, and further, candidate binary content may be obtained from the second binary content to be processed according to the occurrence frequency.
Illustratively, according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability in the second software set, at least one binary content is obtained from the at least two binary contents and used as a second binary content to be processed, the method comprises the following steps:
Obtaining binary contents with occurrence probability lower than a fifth preset probability threshold value in the first software set from the at least two sections of binary contents as fifth binary contents;
Obtaining the occurrence probability of each piece of binary content in the fifth binary content in the second software set from the occurrence probability of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining binary contents with occurrence probability higher than a sixth preset probability threshold value in the second software set from the fifth binary contents as the second binary contents to be processed, wherein the fifth preset probability threshold value is smaller than the sixth preset probability threshold value.
The occurrence probability of the non-malicious binary software in the first software set according to the at least two pieces of binary contents and the occurrence probability of the non-malicious binary software in the second software set are adopted, and the binary contents with low occurrence probability and high occurrence probability in the non-malicious software are selected from the at least two pieces of binary contents and used as the second binary contents to be processed.
In some possible embodiments, according to the occurrence frequency of each piece of binary content in the second to-be-processed binary content in the first software set and the occurrence frequency in the second software set, obtaining the candidate binary content from the second to-be-processed binary content includes:
Obtaining binary contents with the occurrence frequency lower than a fifth preset frequency threshold value in the first software set from the second binary contents to be processed as sixth binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the sixth binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
and obtaining the binary contents with the occurrence frequency higher than a sixth preset frequency threshold value in the second software set from the sixth binary contents as candidate binary contents, wherein the fifth preset frequency threshold value is smaller than the sixth preset frequency threshold value.
Here, according to the occurrence frequency of the second binary content to be processed in the non-malicious binary software of the first software set and the occurrence frequency of the second binary content to be processed in the malicious binary software of the second software set, the binary content with low occurrence frequency in the non-malicious software and high occurrence frequency in the malicious software is selected from the second binary content to be processed as the candidate binary content.
S511: and obtaining at least one section of binary content from the candidate binary content as target binary content according to a second type of statistical index of each section of binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity.
S512: and obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
The implementation manners of steps S511-S512 are the same as those of steps S403-S404, and are not repeated here.
According to the embodiment of the application, after at least two sections of binary contents are extracted from the first software set and the second software set, the binary contents with low occurrence frequency and/or occurrence probability in the non-malicious software are selected from the at least two sections of binary contents and high occurrence frequency and/or occurrence probability in the malicious software are used as candidate binary contents, so that the signatures extracted subsequently are enough to be used for covering most of the malicious software, and are also enough to be prevented from overlapping with the non-malicious software contents, false alarms are reduced, further, target binary contents are obtained from the candidate binary contents, and the signatures of the corresponding malicious software are obtained according to the target binary contents, thereby realizing that part of binary contents of the malicious software are extracted by using an automatic technology to serve as signatures for identifying the malicious software under the condition of no need of manual analysis, solving the problems of overlarge workload of malicious software analysts and low efficiency of extracting signatures, improving the extraction efficiency of signatures of the malicious software, and meeting application requirements. And because the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of analysts, the signature extraction accuracy of malicious software is improved to a certain extent.
Fig. 6 is a schematic flow chart of another method for extracting a file signature according to an embodiment of the present application, where an execution subject of the embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 6, the method may include:
S601: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software.
S602: and obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical index of each piece of binary content in the first software set and a first type of statistical index in the second software set, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability.
The implementation manners of steps S601-S602 are the same as those of steps S401-S402, and are not repeated here.
Optionally, the analysis device alternatively performs one or more sub-processes: the sub-flow of steps S603 to S605, or the sub-flow of steps S606 to S608, or the sub-flow of steps S609 to S611.
S603: and when the second type of statistical index comprises the software coverage proportion, obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents.
S604: and obtaining the software coverage proportion of each piece of binary content in the seventh binary content in the second software set from the software coverage proportion of each piece of binary content in the candidate binary content in the second software set.
S605: and obtaining binary contents with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary contents as target binary contents, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
Here, if the second type of statistical indicator includes the software coverage ratio, selecting, as the target binary content, a binary content having a low software coverage ratio among non-malicious software and a high software coverage ratio among malicious software from the candidate binary content according to the software coverage ratio among the non-malicious binary software of the first software set and the software coverage ratio among the malicious binary software of the second software set.
S606: and when the second type of statistical indexes comprise the set similarity, obtaining binary contents with the set similarity lower than a first preset similarity threshold value in the first software set from the candidate binary contents as eighth binary contents.
S607: and obtaining the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set.
S608: and obtaining binary contents with set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
For example, if the second type of statistical indicator includes the set similarity, according to the set similarity of the candidate binary contents in the non-malicious binary software of the first software set and the set similarity of the candidate binary contents in the malicious binary software of the second software set, selecting, as the target binary content, the binary content with low set similarity in the non-malicious software and high set similarity in the malicious software from the candidate binary contents.
S609: when the second type of statistical index comprises the software coverage proportion and the set similarity, at least one section of binary content is obtained from the candidate binary content as the first characteristic binary content according to the software coverage proportion of each section of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set.
In some possible embodiments, the obtaining, according to the software coverage ratio of each piece of binary content in the candidate binary content in the first software set and the software coverage ratio in the second software set, at least one piece of binary content from the candidate binary content as the first feature binary content includes:
obtaining binary contents with the software coverage proportion in the first software set lower than a third preset proportion threshold value from the candidate binary contents as ninth binary contents;
Obtaining the software coverage proportion of each segment of binary content in the ninth binary content in the second software set from the software coverage proportion of each segment of binary content in the second software set;
And obtaining binary contents with the software coverage proportion higher than a fourth preset proportion threshold value in the second software set from the ninth binary contents as the first characteristic binary contents, wherein the third preset proportion threshold value is smaller than the fourth preset proportion threshold value.
Here, according to the above-mentioned candidate binary contents, the ratio of software coverage in the non-malicious binary software of the first software set and the ratio of software coverage in the malicious binary software of the second software set, the binary contents with low ratio of software coverage in the non-malicious software and high ratio of software coverage in the malicious software are selected from the above-mentioned candidate binary contents as the first characteristic binary contents.
S610: and obtaining the set similarity of each piece of binary content in the first characteristic binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each piece of binary content in the first characteristic binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set.
S611: and obtaining target binary contents from the first characteristic binary contents according to the set similarity of each piece of binary contents in the first characteristic binary contents in the first software set and the set similarity in the second software set.
In some possible embodiments, the obtaining, according to the set similarity of each piece of binary content in the first feature binary content in the first software set and the set similarity of the second software set, the target binary content from the first feature binary content includes:
Obtaining binary contents with the set similarity lower than a third preset similarity threshold value in a first software set from the first characteristic binary contents, and taking the binary contents as twelfth binary contents;
Obtaining the set similarity of each piece of binary content in the twelfth binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a fourth preset similarity threshold value in the second software set from the twelfth binary contents as target binary contents, wherein the third preset similarity threshold value is smaller than the fourth preset similarity threshold value.
The binary content with low set similarity in the non-malicious software and high set similarity in the malicious software is selected from the first characteristic binary content and is used as the target binary content.
Here, when the second type of statistical indicator includes the software coverage ratio and the set similarity, the first obtains at least one piece of binary content from the candidate binary content according to the software coverage ratio as the first feature binary content, and further obtains the target binary content from the first feature binary content according to the set similarity.
Similarly, when the second type of statistical indicator includes the software coverage ratio and the set similarity, at least one piece of binary content may be obtained from the candidate binary content according to the set similarity as the second feature binary content, and further, the target binary content may be obtained from the second feature binary content according to the software coverage ratio.
Illustratively, according to the set similarity of each piece of binary content in the candidate binary content in the first software set and the set similarity in the second software set, at least one piece of binary content is obtained from the candidate binary content and used as the second characteristic binary content, and the method comprises the following steps:
obtaining binary contents with set similarity lower than a fifth preset similarity threshold value in the first software set from the candidate binary contents as eleventh binary contents;
obtaining the set similarity of each piece of binary content in the eleventh binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a sixth preset similarity threshold value in the second software set from the eleventh binary contents as second characteristic binary contents, wherein the fifth preset similarity threshold value is smaller than the sixth preset similarity threshold value.
And selecting the binary content with low set similarity in the non-malicious software and high set similarity in the malicious software from the candidate binary content as the second characteristic binary content.
In some possible embodiments, according to the software coverage ratio of each piece of binary content in the second feature binary content in the first software set and the software coverage ratio of the second software set, obtaining the target binary content from the second feature binary content includes:
Obtaining binary contents with the software coverage proportion in the first software set lower than a fifth preset proportion threshold value from the second characteristic binary contents as twelfth binary contents;
Obtaining the software coverage proportion of each segment of binary content in the twelfth binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a sixth preset ratio threshold value in the second software set from the twelfth binary contents as target binary contents, wherein the fifth preset ratio threshold value is smaller than the sixth preset ratio threshold value.
Here, according to the above second characteristic binary content, the ratio of software coverage in the non-malicious binary software of the first software set and the ratio of software coverage in the malicious binary software of the second software set, the binary content with low ratio of software coverage in the non-malicious software and high ratio of software coverage in the malicious software is selected from the above second characteristic binary content as the candidate binary content.
S612: and obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
The implementation manner of step S612 is the same as that of step S404, and will not be described here again.
According to the embodiment of the application, after at least two sections of binary contents are extracted from a first software set and a second software set, according to a first type of statistical index of each section of binary content in the first software set and a first type of statistical index in the second software set, at least one section of binary content is obtained from the at least two sections of binary contents and is used as candidate binary contents, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability, then, binary contents with low software coverage proportion and/or set similarity in non-malicious software are selected from the candidate binary contents and are used as target binary contents, the signature of the corresponding malicious software is obtained according to the target binary contents, so that the extracted signature is enough to cover most of malicious software, and meanwhile, the signature is enough to avoid overlapping with non-malicious software contents, so that the malicious software can be reduced, and meanwhile, the malicious software can be extracted by using an automatic technology to extract part of the binary content to be used as a malicious software signature under the condition of not relying on manual analysis, and the malicious software can be analyzed with low efficiency. And because the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of analysts, the signature extraction accuracy of malicious software is improved to a certain extent.
Fig. 7 is a schematic flow chart of another method for extracting a file signature according to an embodiment of the present application, where an execution subject of the embodiment may be the analysis device 101 in fig. 1, the analysis device 201 in fig. 2, or the analysis device 301 in fig. 3, and a specific execution subject may be determined according to an actual application scenario. As shown in fig. 7, the method may include:
S701: at least two pieces of binary content are extracted from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software.
S702: and obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical index of each piece of binary content in a first software set and a first type of statistical index in a second software set, wherein the first type of statistical index comprises at least one of occurrence frequency and occurrence probability.
The implementation manners of steps S701-S702 are the same as those of steps S401-S402, and are not repeated here.
S703: and determining the information entropy value corresponding to each piece of binary content in the candidate binary content according to the occurrence frequency of the context character in the binary software of the second software set.
Here, the context character includes a context character and a context character, and the information entropy value includes a context entropy value and a context entropy value. The above characters are the characters of the preset number of the corresponding contents of each section of binary contents in the candidate binary contents in the binary software of the second software set respectively, and the following characters are the characters of the preset number of the corresponding contents of each section of binary contents in the candidate binary contents in the binary software of the second software set respectively. Here, the above-described front preset number and rear preset number may be determined according to actual conditions, and the embodiment of the present application is not particularly limited thereto.
The determining the information entropy value corresponding to each segment of binary content in the candidate binary content according to the occurrence frequency of the context character in the binary software of the second software set, wherein the information entropy value comprises:
according to the number of times that each segment of binary content in the candidate binary content appears in each character symbol in the binary software of the second software set respectively, the proportion of the number of times that all characters in the corresponding character symbol appear is used for determining the corresponding entropy value of each segment of binary content in the candidate binary content respectively;
And determining the corresponding following entropy value of each piece of binary content in the candidate binary content according to the frequency of each character appearing in the binary software of the second software set and the proportion of the frequency of each character appearing in the corresponding following character.
S704: and deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
Wherein, the higher the information entropy value is, the more random the content of the context is, and the lower the information entropy value is, the more consistent the content of the context is. If the entropy of the information of the candidate binary content in the malicious binary software of the second software set is higher than the preset entropy threshold, the candidate binary content is different from the context of the candidate binary content in the malicious binary software of the second software set, and the candidate binary content is deleted (if the candidate binary content is malicious, the candidate binary content is the same as the context of the candidate binary content in the malicious binary software of the second software set).
S705: and determining an importance ordering rule according to the attribute information corresponding to each piece of binary content in the candidate binary content.
Wherein the attribute information includes the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the mean, variance and entropy of the locations and the first occurrence position of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the locations and the first occurrence position of the candidate binary contents in each binary software of the second software set, and at least one of printable characters of the candidate binary contents.
Here, the attribute information may be adjusted according to actual conditions, which is not limited by the embodiment of the present application.
S706: and according to the importance ranking rule, ranking at least two pieces of binary contents in the candidate binary contents according to the importance from high to low.
Illustratively, the importance ranking rule includes, in order from high importance to low importance:
The number of binary contents at the same position of the second software set is higher than a first preset number threshold;
The variance of the positions of the binary contents in the binary software of the first software set is smaller than a first preset variance threshold, the variance of the positions of the corresponding binary contents in the binary software of the second software set is smaller than a second preset variance threshold, the entropy of the positions of the corresponding binary contents in the binary software of the first software set is smaller than a first preset entropy threshold, the entropy of the positions of the corresponding binary contents in the binary software of the second software set is smaller than a second preset entropy threshold, and the ratio of the occurrence probability of the corresponding binary contents in the first software set to the occurrence probability of the corresponding binary contents in the second software set is larger than a preset ratio threshold;
the distance between the first appearance position of the binary content in each binary software of the first software set and the file header is lower than a first distance threshold value, and the distance between the first appearance position of the corresponding binary content in each binary software of the second software set and the file header is lower than a second distance threshold value;
The number of the binary contents in the second software set is higher than a second preset number threshold, the variance of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a third preset variance threshold, the variance of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a fourth preset variance threshold, and the entropy of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a third preset entropy threshold;
The number of the binary contents in the first software set is lower than a third preset number threshold, the variance of the positions of the corresponding binary contents in each binary software of the first software set is larger than a fifth preset variance threshold, and the entropy of the positions of the corresponding binary contents in each binary software of the first software set is larger than a fourth preset entropy threshold;
The number of printable characters of the binary content is greater than a fourth preset number threshold.
Here, the importance ranking rule (including content, ranking, etc.) may be adjusted according to actual situations, which is not limited by the embodiment of the present application.
S707: and obtaining at least one section of binary content from the candidate binary content as target binary content according to a second type of statistical index of each section of binary content in the first software set and a second type of statistical index in the second software set, wherein the second type of statistical index comprises at least one of software coverage proportion and integrated similarity.
S708: and determining the sorting result of the target binary contents according to the sorting result of the candidate binary contents.
S709: and deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
S710: and obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
According to the embodiment of the application, at least two sections of binary contents are extracted from a first software set and a second software set, at least one section of binary content is obtained from the at least two sections of binary contents and is used as a candidate binary content, further, binary contents with information entropy higher than a preset entropy threshold value are deleted from the candidate binary contents, wrong candidate binary contents are deleted, the accuracy of a subsequent processing result is improved, then an importance sorting rule is determined according to attribute information corresponding to each section of binary contents in the candidate binary contents, further, the candidate binary contents are sorted according to the importance sorting rule from high importance to low, the target binary contents are obtained from the candidate binary contents, the sorting result of the target binary contents is determined according to the sorting result of the candidate binary contents, the binary contents with sorting sequence numbers behind the preset sequence numbers are deleted from the target binary contents, and further, the generated signature of the corresponding malicious software can be generated according to the binary contents with higher importance, so that the generated signature can be used for automatically identifying the malicious software, and meanwhile, the malicious software can be extracted by the malicious software can be used as a part of a malicious software with high-quality malicious software, and a malicious software can be analyzed by a large-scale. And because the signature extraction method provided by the embodiment of the application is not influenced by personal experience and subjective factors of analysts, the signature extraction accuracy of malicious software is improved to a certain extent.
Fig. 8 is a schematic structural diagram of a file signature extraction device provided by the present application, where the device includes: an extraction module 801, a first obtaining module 802, a second obtaining module 803, and a third obtaining module 804.
The extracting module 801 is configured to extract at least two pieces of binary content from a first software set and a second software set, where the first software set contains a first number of non-malicious binary software and the second software set contains a second number of malicious binary software.
A first obtaining module 802, configured to obtain, as candidate binary contents, at least one piece of binary content from the at least two pieces of binary content according to a first type of statistical indicator in the first software set and a first type of statistical indicator in the second software set, where the first type of statistical indicator includes at least one of an occurrence frequency and an occurrence probability.
A second obtaining module 803, configured to obtain, as target binary content, at least one piece of binary content from the candidate binary content according to a second type of statistical indicator in the first software set and a second type of statistical indicator in the second software set, where the second type of statistical indicator includes at least one of a software coverage ratio and a collection similarity.
A third obtaining module 804, configured to obtain a signature of the corresponding malicious binary software according to the target binary content, where the signature is used to identify the malicious binary software.
One possible design is the extraction module 801 described above, specifically for:
And extracting binary contents with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
A possible design, when the first type of statistical indicator includes the frequency of occurrence, the first obtaining module 802 is specifically configured to:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in a first software set from the at least two sections of binary contents as first binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
A possible design, when the first type of statistical indicator includes the occurrence probability, the first obtaining module 802 is specifically configured to:
Obtaining binary contents with occurrence probability lower than a first preset probability threshold value in a first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the second software set;
And obtaining the binary contents with occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
A possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module 802 is specifically configured to:
according to the occurrence frequency of each binary content in the at least two binary contents in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first binary content to be processed in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
and obtaining candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set.
A possible design, the first obtaining module 802 obtains, as a first binary content to be processed, at least one piece of binary content from the at least two pieces of binary content according to the occurrence frequency of each piece of binary content in the first software set and the occurrence frequency of each piece of binary content in the second software set, respectively, where the first binary content includes:
Obtaining binary contents with the occurrence frequency lower than a third preset frequency threshold value in the first software set from the at least two sections of binary contents as third binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the third binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
and obtaining the binary content with the occurrence frequency higher than a fourth preset frequency threshold value in the second software set from the third binary content as the first binary content to be processed, wherein the third preset frequency threshold value is smaller than the fourth preset frequency threshold value.
A possible design, the first obtaining module 802 obtains candidate binary contents from the first binary contents to be processed according to the occurrence probability of each piece of binary contents in the first software set and the occurrence probability in the second software set, respectively, where the candidate binary contents include:
obtaining binary contents with occurrence probability lower than a third preset probability threshold value in the first software set from the first binary contents to be processed as fourth binary contents;
Obtaining the occurrence probability of each piece of binary content in the fourth binary content in the second software set from the occurrence probability of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with occurrence probability higher than a fourth preset probability threshold value in the second software set from the fourth binary contents as candidate binary contents, wherein the third preset probability threshold value is smaller than the fourth preset probability threshold value.
A possible design, when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the first obtaining module 802 is specifically configured to:
according to the occurrence probability of each binary content in the at least two binary contents in the first software set and the occurrence probability of each binary content in the second software set, at least one binary content is obtained from the at least two binary contents and used as second binary content to be processed;
Obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the first software set from the frequency of occurrence of each piece of binary content in the first software set, respectively, and obtaining the frequency of occurrence of each piece of binary content in the second to-be-processed binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set;
And obtaining candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each segment of binary contents in the second binary contents to be processed in the first software set and the occurrence frequency in the second software set.
A possible design, the first obtaining module 802 obtains, as the second binary content to be processed, at least one piece of binary content from the at least two pieces of binary content according to the occurrence probability of each piece of binary content in the first software set and the occurrence probability in the second software set, respectively, where the second binary content is the second binary content to be processed:
Obtaining binary contents with occurrence probability lower than a fifth preset probability threshold value in the first software set from the at least two sections of binary contents as fifth binary contents;
Obtaining the occurrence probability of each piece of binary content in the fifth binary content in the second software set from the occurrence probability of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining binary contents with occurrence probability higher than a sixth preset probability threshold value in the second software set from the fifth binary contents as second binary contents to be processed, wherein the fifth preset probability threshold value is smaller than the sixth preset probability threshold value.
A possible design, the first obtaining module 802 obtains candidate binary contents from the second binary contents to be processed according to the occurrence frequency of each piece of binary contents in the second binary contents to be processed in the first software set and the occurrence frequency in the second software set, respectively, including:
Obtaining binary contents with the occurrence frequency lower than a fifth preset frequency threshold value in the first software set from the second binary contents to be processed as sixth binary contents;
Obtaining the frequency of occurrence of each piece of binary content in the sixth binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
and obtaining the binary contents with the occurrence frequency higher than a sixth preset frequency threshold value in the second software set from the sixth binary contents as candidate binary contents, wherein the fifth preset frequency threshold value is smaller than the sixth preset frequency threshold value.
A possible design, when the second type of statistical indicator includes the software coverage ratio, is specifically configured to:
Obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
Obtaining the software coverage proportion of each segment of binary content in the seventh binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a second preset ratio threshold value in the second software set from the seventh binary contents as target binary contents, wherein the first preset ratio threshold value is smaller than the second preset ratio threshold value.
A possible design, when the second type of statistical indicator includes the set similarity, the second obtaining module 803 is specifically configured to:
obtaining binary contents with set similarity lower than a first preset similarity threshold value in a first software set from the candidate binary contents as eighth binary contents;
Acquiring the set similarity of each piece of binary content in the eighth binary content in the second software set from the set similarity of each piece of binary content in the second software set;
And obtaining binary contents with set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
A possible design, when the second type of statistical indicator includes the software coverage ratio and the aggregate similarity, is specifically configured to:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Acquiring the set similarity of each piece of binary content in the first characteristic binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and acquiring the set similarity of each piece of binary content in the first characteristic binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
and obtaining target binary contents from the first characteristic binary contents according to the set similarity of each piece of binary contents in the first characteristic binary contents in the first software set and the set similarity of each piece of binary contents in the second software set.
A possible design, the second obtaining module 803 obtains, as the first feature binary content, at least one piece of binary content from the candidate binary content according to the software coverage ratio of each piece of binary content in the first software set and the software coverage ratio of the second software set, respectively, where the piece of binary content is the first feature binary content:
Obtaining binary contents with the software coverage proportion lower than a third preset proportion threshold value in the first software set from the candidate binary contents as ninth binary contents;
Obtaining the software coverage proportion of each segment of binary content in the ninth binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining binary contents with the software coverage proportion higher than a fourth preset proportion threshold value in the second software set from the ninth binary contents as the first characteristic binary contents, wherein the third preset proportion threshold value is smaller than the fourth preset proportion threshold value.
A possible design, the second obtaining module 803 obtains, from the first feature binary content, the target binary content according to the set similarity of each piece of binary content in the first software set and the set similarity of the second software set, where the set similarity includes:
Obtaining binary contents with the set similarity lower than a third preset similarity threshold value in the first software set from the first characteristic binary contents as twelfth binary contents;
Obtaining the set similarity of each segment of binary content in the twelfth binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
and obtaining binary contents with the set similarity higher than a fourth preset similarity threshold value in the second software set from the twelfth binary contents as the target binary contents, wherein the third preset similarity threshold value is smaller than the fourth preset similarity threshold value.
A possible design, when the second type of statistical indicator includes the software coverage ratio and the aggregate similarity, is specifically configured to:
according to the set similarity of each segment of binary content in the candidate binary content in the first software set and the set similarity of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as second characteristic binary content;
Obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set from the software coverage proportion of each segment of binary content in the candidate binary content in the first software set, and obtaining the software coverage proportion of each segment of binary content in the second characteristic binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the second characteristic binary content according to the software coverage proportion of each segment of binary content in the second characteristic binary content in the first software set and the software coverage proportion in the second software set respectively.
A possible design, the second obtaining module 803 obtains, as the second feature binary content, at least one piece of binary content from the candidate binary content according to the set similarity of each piece of binary content in the candidate binary content in the first software set and the set similarity in the second software set, respectively, where the second feature binary content includes:
Obtaining binary contents with the set similarity lower than a fifth preset similarity threshold value in the first software set from the candidate binary contents as eleventh binary contents;
Obtaining the set similarity of each piece of binary content in the eleventh binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the set similarity higher than a sixth preset similarity threshold value in the second software set from the eleventh binary contents as the second characteristic binary contents, wherein the fifth preset similarity threshold value is smaller than the sixth preset similarity threshold value.
A possible design, the second obtaining module 803 obtains the target binary content from the second feature binary content according to the software coverage ratio of each piece of binary content in the second feature binary content in the first software set and the software coverage ratio in the second software set, respectively, including:
Obtaining binary contents with the software coverage proportion lower than a fifth preset proportion threshold value in the first software set from the second characteristic binary contents as twelfth binary contents;
Obtaining the software coverage proportion of each segment of binary content in the twelfth binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the software coverage ratio higher than a sixth preset ratio threshold value in the second software set from the twelfth binary contents as the target binary contents, wherein the fifth preset ratio threshold value is smaller than the sixth preset ratio threshold value.
One possible design is that the second obtaining module 803 is further configured to, before obtaining at least one piece of binary content from the candidate binary content as the target binary content:
according to the occurrence frequency of the context characters in the binary software of the second software set, respectively, determining the information entropy value corresponding to each segment of binary content in the candidate binary content;
And deleting the binary contents with the information entropy value higher than a preset entropy value threshold value from the candidate binary contents.
One possible design is that if the candidate binary content includes at least two pieces of binary content, the first obtaining module 802 is further configured to, after obtaining at least two pieces of binary content from the binary content as the candidate binary content:
Determining an importance ordering rule according to attribute information corresponding to each piece of binary content in the candidate binary content;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
after obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
Determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
And deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
A possible design, the attribute information includes a number of the candidate binary contents in the first software set, a number of the candidate binary contents in the second software set, an occurrence probability of the candidate binary contents in the first software set, an occurrence probability of the candidate binary contents in the second software set, a mean, variance and entropy of a position where the candidate binary contents first appear in each binary software of the first software set, a mean, variance and entropy of a position where the candidate binary contents first appear in each binary software of the second software set, and at least one of printable characters of the candidate binary contents.
One possible design, the importance ranking rule includes, in order from high importance to low importance:
The number of binary contents at the same position of the second software set is higher than a first preset number threshold;
The variance of the positions of the binary contents in the binary software of the first software set is smaller than a first preset variance threshold, the variance of the positions of the corresponding binary contents in the binary software of the second software set is smaller than a second preset variance threshold, the entropy of the positions of the corresponding binary contents in the binary software of the first software set is smaller than a first preset entropy threshold, the entropy of the positions of the corresponding binary contents in the binary software of the second software set is smaller than a second preset entropy threshold, and the ratio of the occurrence probability of the corresponding binary contents in the first software set to the occurrence probability of the corresponding binary contents in the second software set is larger than a preset ratio threshold;
The distance between the first appearance position of the binary content in each binary software of the first software set and the file header is lower than a first distance threshold value, and the distance between the first appearance position of the corresponding binary content in each binary software of the second software set and the file header is lower than a second distance threshold value;
The number of binary contents in the second software set is higher than a second preset number threshold, the variance of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a third preset variance threshold, the variance of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a fourth preset variance threshold, and the entropy of the positions of the corresponding binary contents in the respective binary software of the second software set is lower than a third preset entropy threshold;
The number of the binary contents in the first software set is lower than a third preset number threshold, the variance of the positions of the corresponding binary contents in each binary software of the first software set is larger than a fifth preset variance threshold, and the entropy of the positions of the corresponding binary contents in each binary software of the first software set is larger than a fourth preset entropy threshold;
The number of printable characters of the binary content is greater than a fourth preset number threshold.
One possible design, the contextual characters comprise contextual characters and contextual characters, and the information entropy comprises a contextual entropy and a contextual entropy;
The second obtaining module 803 determines, according to the occurrence frequency of the context character in the binary software of the second software set, for each piece of binary content in the candidate binary content, an information entropy value corresponding to each piece of binary content in the candidate binary content, including:
According to the number of times that each segment of binary content in the candidate binary content appears in each character symbol in the binary software of the second software set, the proportion of the number of times that each character appears in the corresponding character symbol is occupied, and the corresponding entropy value of each segment of binary content in the candidate binary content is determined;
and determining the corresponding lower entropy value of each section of binary content in the candidate binary content according to the occurrence times of each lower character symbol in the binary software of the second software set, which is the proportion of the occurrence times of all characters in the corresponding lower character.
The device of the present embodiment may be correspondingly used to execute the technical solution in the embodiment shown in the foregoing method, and its implementation principle, implementation details and technical effects are similar, and are not repeated herein.
Alternatively, FIG. 9 schematically provides one possible basic hardware architecture of the computing device of the present application.
With reference to fig. 9, a computing device 900 includes a processor 901, memory 902, a communication interface 903, and a bus 904.
Where computing device 900 may be a computer or server, the application is not particularly limited in this regard. The number of processors 901 in computing device 900 may be one or more, only one of which processors 901 is illustrated in fig. 9. Alternatively, the processor 901 may be a central processing unit (central processing unit, CPU). If the computing device 900 has multiple processors 901, the types of the multiple processors 901 may be different or may be the same. Optionally, the multiple processors 901 of the computing device 900 may also be integrated as a multi-core processor.
Memory 902 stores computer instructions and data; the memory 902 may store computer instructions and data necessary to implement the above-described file signature extraction method provided by the present application, for example, the memory 902 stores instructions for implementing the steps of the above-described file signature extraction method. The memory 902 may be any one or any combination of the following storage media: nonvolatile memory (e.g., read Only Memory (ROM), solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory).
The communication interface 903 may be any one or any combination of the following devices: a network interface (e.g., ethernet interface), a wireless network card, etc., having network access functionality.
The communication interface 903 is used for data communication between the computing device 900 and other computing devices or terminals.
Fig. 9 shows bus 904 with a thick line. A bus 904 may connect the processor 901 with the memory 902 and the communication interface 903. Thus, through bus 904, processor 901 may access memory 902 and may also interact with other computing devices or terminals using communication interface 903.
In the present application, the computing device 900 executes the computer instructions in the memory 902, so that the computing device 900 implements the above-described file signature extraction method provided by the present application, or so that the computing device 900 deploys the above-described file signature extraction apparatus.
The file signature extraction device may be realized by hardware as a hardware module or as a circuit unit, in addition to the software as in fig. 9.
The present application provides a computer readable storage medium, the computer program product comprising computer instructions for instructing a computing device to execute the above-described file signature extraction method provided by the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

Claims (20)

1. A method for extracting a file signature, comprising:
Extracting at least two pieces of binary content from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software;
according to a first type of statistical index of each piece of binary content in the first software set and the first type of statistical index in the second software set, at least one piece of binary content is obtained from the at least two pieces of binary content and used as candidate binary content, wherein the first type of statistical index of the candidate binary content in the first software set is lower than a first preset threshold value and the first type of statistical index in the second software set is higher than a second preset threshold value, and the first type of statistical index in the first software set comprises at least one of occurrence frequency and occurrence probability;
Obtaining at least one piece of binary content from the candidate binary content as target binary content according to a second type statistical index of each piece of binary content in the first software set and the second type statistical index in the second software set, wherein the second type statistical index of the target binary content in the first software set is lower than a third preset threshold and the second type statistical index in the second software set is higher than a fourth preset threshold;
and obtaining a signature of the corresponding malicious binary software according to the target binary content, wherein the signature is used for identifying the malicious binary software.
2. The method of claim 1, wherein the extracting at least two pieces of binary content from the first set of software and the second set of software comprises:
And extracting binary contents with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
3. The method of claim 1, wherein when the first type of statistical indicator includes the frequency of occurrence, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, respectively, of each piece of binary content including:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
4. The method of claim 1, wherein when the first type of statistical indicator includes the occurrence probability, obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to the first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, respectively, including:
obtaining binary contents with the occurrence probability lower than a first preset probability threshold value in the first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
And obtaining the binary contents with the occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
5. The method according to claim 1, wherein when the first type of statistical indicator includes the occurrence frequency and the occurrence probability, the obtaining at least one piece of binary content from the at least two pieces of binary content as candidate binary content according to a first type of statistical indicator in the first software set and the first type of statistical indicator in the second software set, respectively, includes:
According to the occurrence frequency of each binary content in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first to-be-processed binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
And obtaining the candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set respectively.
6. The method according to claim 1, wherein when the second type of statistical indicator includes the software coverage ratio, the obtaining, as target binary content, at least one piece of binary content from the candidate binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
obtaining binary contents with the software coverage proportion lower than a first preset proportion threshold value in the first software set from the candidate binary contents as seventh binary contents;
obtaining the software coverage proportion of each segment of binary content in the seventh binary content in the second software set from the software coverage proportion of each segment of binary content in the candidate binary content in the second software set;
and obtaining binary contents with the software coverage proportion higher than a second preset proportion threshold value in the second software set from the seventh binary contents as the target binary contents, wherein the first preset proportion threshold value is smaller than the second preset proportion threshold value.
7. The method according to claim 1, wherein when the second type of statistical indicator includes the set similarity, the obtaining, as target binary content, at least one piece of binary content from the candidate binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
obtaining binary contents with the set similarity lower than a first preset similarity threshold value in the first software set from the candidate binary contents, and taking the binary contents as eighth binary contents;
obtaining the set similarity of each segment of binary content in the eighth binary content in the second software set from the set similarity of each segment of binary content in the candidate binary content in the second software set;
And obtaining binary contents with the set similarity higher than a second preset similarity threshold value in the second software set from the eighth binary contents as the target binary contents, wherein the first preset similarity threshold value is smaller than the second preset similarity threshold value.
8. The method according to claim 1, wherein when the second type of statistical indicator includes the software coverage ratio and the set similarity, the obtaining at least one piece of binary content from the candidate binary content as a target binary content according to the second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, respectively, includes:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Obtaining the set similarity of each piece of binary content in the first feature binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each piece of binary content in the first feature binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the first characteristic binary content according to the set similarity of each section of binary content in the first software set and the set similarity of the second software set.
9. The method according to any one of claims 1 to 8, wherein at least one piece of binary content is obtained from the candidate binary content, and before the binary content is targeted, the method further comprises:
determining information entropy values corresponding to each piece of binary content in the candidate binary content according to the occurrence frequency of context characters in the binary software of the second software set;
And deleting the binary content with the information entropy value higher than a preset entropy value threshold value from the candidate binary content.
10. The method according to any one of claims 1 to 8, wherein if at least two pieces of binary content are contained in the candidate binary content, at least two pieces of binary content are obtained from the binary content as candidate binary content, the method further comprising, after:
Determining an importance ranking rule according to attribute information corresponding to each piece of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following components: the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the position of the first occurrence of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the position of the first occurrence of the candidate binary contents in each binary software of the second software set, and the printable character of the candidate binary contents;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
After the obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
Determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
and deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
11. A document signature extraction apparatus, comprising:
An extraction module for extracting at least two pieces of binary content from a first set of software containing a first number of non-malicious binary software and a second set of software containing a second number of malicious binary software;
A first obtaining module, configured to obtain, according to a first type of statistical index of each piece of binary content in the first software set and the first type of statistical index in the second software set, at least one piece of binary content from the at least two pieces of binary content as a candidate binary content, where the first type of statistical index of the candidate binary content in the first software set is lower than a first preset threshold and the first type of statistical index in the second software set is higher than a second preset threshold, where the first type of statistical index includes at least one of occurrence frequency and occurrence probability;
A second obtaining module, configured to obtain, as target binary content, at least one piece of binary content from the candidate binary content according to a second type of statistical indicator in the first software set and the second type of statistical indicator in the second software set, where the second type of statistical indicator includes at least one of a software coverage proportion and a set similarity, the second type of statistical indicator of the target binary content in the first software set is lower than a third preset threshold and the second type of statistical indicator in the second software set is higher than a fourth preset threshold;
And the third obtaining module is used for obtaining the signature of the corresponding malicious binary software according to the target binary content, and the signature is used for identifying the malicious binary software.
12. The apparatus according to claim 11, wherein the extraction module is specifically configured to:
And extracting binary contents with preset length from each binary software in the first software set and the second software set by adopting a sliding window, wherein the preset length is the byte amount covered by the sliding window.
13. The apparatus of claim 11, wherein when the first type of statistical indicator includes the frequency of occurrence, the first obtaining module is specifically configured to:
Obtaining binary contents with the occurrence frequency lower than a first preset frequency threshold value in the first software set from the at least two sections of binary contents as first binary contents;
obtaining the frequency of occurrence of each piece of binary content in the first binary content in the second software set from the frequency of occurrence of each piece of binary content in the second software set in the at least two pieces of binary content;
And obtaining the binary contents with the occurrence frequency higher than a second preset frequency threshold value in the second software set from the first binary contents as the candidate binary contents, wherein the first preset frequency threshold value is smaller than the second preset frequency threshold value.
14. The apparatus of claim 11, wherein when the first type of statistical indicator includes the occurrence probability, the first obtaining module is specifically configured to:
obtaining binary contents with the occurrence probability lower than a first preset probability threshold value in the first software set from the at least two sections of binary contents as second binary contents;
Obtaining the occurrence probability of each piece of binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set respectively;
And obtaining the binary contents with the occurrence probability higher than a second preset probability threshold value in the second software set from the second binary contents as the candidate binary contents, wherein the first preset probability threshold value is smaller than the second preset probability threshold value.
15. The apparatus of claim 11, wherein when the first type of statistical indicator includes the frequency of occurrence and the probability of occurrence, the first obtaining module is specifically configured to:
According to the occurrence frequency of each binary content in the first software set and the occurrence frequency in the second software set, at least one binary content is obtained from the at least two binary contents and used as a first binary content to be processed;
Obtaining the occurrence probability of each piece of binary content in the first software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the first software set, and obtaining the occurrence probability of each piece of binary content in the first to-be-processed binary content in the second software set from the occurrence probability of each piece of binary content in the at least two pieces of binary content in the second software set;
And obtaining the candidate binary contents from the first binary contents to be processed according to the occurrence probability of each segment of binary contents in the first software set and the occurrence probability in the second software set respectively.
16. The apparatus of claim 11, wherein when the second type of statistical indicator includes the software coverage ratio and the set similarity, the second obtaining module is specifically configured to:
According to the software coverage proportion of each segment of binary content in the candidate binary content in the first software set and the software coverage proportion of the second software set, at least one segment of binary content is obtained from the candidate binary content and used as first characteristic binary content;
Obtaining the set similarity of each piece of binary content in the first feature binary content in the first software set from the set similarity of each piece of binary content in the candidate binary content in the first software set, and obtaining the set similarity of each piece of binary content in the first feature binary content in the second software set from the set similarity of each piece of binary content in the candidate binary content in the second software set;
And obtaining the target binary content from the first characteristic binary content according to the set similarity of each section of binary content in the first software set and the set similarity of the second software set.
17. The apparatus according to any one of claims 11 to 16, wherein before obtaining at least one piece of binary content from the candidate binary content as target binary content, the second obtaining module is further configured to:
determining information entropy values corresponding to each piece of binary content in the candidate binary content according to the occurrence frequency of context characters in the binary software of the second software set;
And deleting the binary content with the information entropy value higher than a preset entropy value threshold value from the candidate binary content.
18. The apparatus according to any one of claims 11 to 16, wherein if the candidate binary content includes at least two pieces of binary content, the first obtaining module is further configured to, after obtaining the at least two pieces of binary content from the binary content as the candidate binary content:
Determining an importance ranking rule according to attribute information corresponding to each piece of binary content in the candidate binary content, wherein the attribute information comprises at least one of the following components: the number of the candidate binary contents in the first software set, the number of the candidate binary contents in the second software set, the occurrence probability of the candidate binary contents in the first software set, the occurrence probability of the candidate binary contents in the second software set, the position of the first occurrence of the candidate binary contents in each binary software of the first software set, the mean, variance and entropy of the position of the first occurrence of the candidate binary contents in each binary software of the second software set, and the printable character of the candidate binary contents;
According to the importance ranking rule, ranking at least two segments of binary contents in the candidate binary contents according to the importance from high to low;
After the obtaining at least one piece of binary content from the candidate binary content as the target binary content, the method further comprises:
Determining the sorting result of the target binary content according to the sorting result of the candidate binary content;
and deleting the binary contents with the sequence numbers after the preset sequence numbers from the target binary contents.
19. A computing device, comprising:
including a processor and a memory;
The memory is used for storing computer instructions;
the processor configured to execute the computer instructions stored in the memory, to cause the computing device to perform the method of any one of claims 1 to 10.
20. A computer program product, characterized in that it comprises computer instructions that instruct a computing device to perform the method of any one of claims 1 to 10.
CN201911295341.5A 2019-12-16 2019-12-16 File signature extraction method and device Active CN112989432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911295341.5A CN112989432B (en) 2019-12-16 2019-12-16 File signature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911295341.5A CN112989432B (en) 2019-12-16 2019-12-16 File signature extraction method and device

Publications (2)

Publication Number Publication Date
CN112989432A CN112989432A (en) 2021-06-18
CN112989432B true CN112989432B (en) 2024-06-18

Family

ID=76343383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911295341.5A Active CN112989432B (en) 2019-12-16 2019-12-16 File signature extraction method and device

Country Status (1)

Country Link
CN (1) CN112989432B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145780B (en) * 2017-03-31 2021-07-27 腾讯科技(深圳)有限公司 Malicious software detection method and device
CN107222511B (en) * 2017-07-25 2021-08-13 深信服科技股份有限公司 Malicious software detection method and device, computer device and readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376262A (en) * 2014-12-08 2015-02-25 中国科学院深圳先进技术研究院 Android malware detecting method based on Dalvik command and authority combination

Also Published As

Publication number Publication date
CN112989432A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN110099059B (en) Domain name identification method and device and storage medium
CN106919555B (en) System and method for field extraction of data contained within a log stream
US10789366B2 (en) Security information management system and security information management method
US11470097B2 (en) Profile generation device, attack detection device, profile generation method, and profile generation computer program
WO2019076191A1 (en) Keyword extraction method and device, and storage medium and electronic device
CN107408115B (en) Web site filter, method and medium for controlling access to content
JP7120350B2 (en) SECURITY INFORMATION ANALYSIS METHOD, SECURITY INFORMATION ANALYSIS SYSTEM AND PROGRAM
US10454967B1 (en) Clustering computer security attacks by threat actor based on attack features
CN104933056A (en) Uniform resource locator (URL) de-duplication method and device
US11799863B2 (en) Creation device, creation system, creation method, and creation program
CN110069693B (en) Method and device for determining target page
EP3256978A1 (en) Method and apparatus for assigning device fingerprints to internet devices
US20180046729A1 (en) Determining whether to process identified uniform resource locators
CN111371776A (en) Method, device, server and storage medium for detecting abnormality of HTTP request data
US11423099B2 (en) Classification apparatus, classification method, and classification program
CN114969840A (en) Data leakage prevention method and device
CN110392032B (en) Method, device and storage medium for detecting abnormal URL
US20240095289A1 (en) Data enrichment systems and methods for abbreviated domain name classification
CN110019400B (en) Data storage method, electronic device and storage medium
CN112989432B (en) File signature extraction method and device
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN111368128A (en) Target picture identification method and device and computer readable storage medium
CN115225328B (en) Page access data processing method and device, electronic equipment and storage medium
CN110309328B (en) Data storage method and device, electronic equipment and storage medium
CN109272005B (en) Identification rule generation method and device and deep packet inspection equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant