CN106778241B

CN106778241B - Malicious file identification method and device

Info

Publication number: CN106778241B
Application number: CN201611067380.6A
Authority: CN
Inventors: 杜强
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-11-28
Filing date: 2016-11-28
Publication date: 2020-12-25
Anticipated expiration: 2036-11-28
Also published as: CN106778241A

Abstract

The invention discloses a method and a device for identifying malicious files, which relate to the technical field of computer security and are used for improving the identification precision of the malicious files, and the main technical scheme of the invention is as follows: acquiring a dynamic characteristic vector and a static characteristic vector of a target file; inputting the dynamic characteristic vector and the static characteristic vector of the target file into a preset classifier, and calculating the file content malicious probability of the target file; and identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file, wherein the file source malicious probability of the target file is determined according to the source information of the target file. The method and the device are mainly used for identifying the malicious files.

Description

Malicious file identification method and device

Technical Field

The invention relates to the technical field of computer security, in particular to a method and a device for identifying malicious files.

Background

With the continuous development of computer and internet technology, malicious files are also increased explosively, and the attack means and disguise means are also developed towards diversified and complicated ways. In addition, the underground industry chain of computer crimes is continuously perfected, and the degree of industrialization and scale are increasingly improved, so that the resistance to malicious documents is a very challenging subject at present.

At present, malicious codes are mainly identified through a static monitoring technology or a dynamic monitoring technology, and signature matching is carried out after the static monitoring technology preprocesses a target file, namely a virus library is matched; the dynamic monitoring technology mainly performs identification according to some behavior characteristics of the target file, such as modifying a specific registry, opening a specific port and the like.

However, static monitoring techniques lack the ability to detect variants of malicious files, new malicious files, and dynamic monitoring techniques lack the ability to identify new malware that do not have obvious behavioral characteristics. However, the existing method is limited to a single static monitoring technology or a single dynamic monitoring technology, so that the malicious file is easy to hide by adopting some general escape technologies, and the identification accuracy of the malicious file is low.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for identifying malicious files, and mainly aims to improve the identification accuracy of malicious files.

According to an aspect of the present invention, there is provided a method for identifying a malicious file, including:

acquiring a dynamic characteristic vector and a static characteristic vector of a target file;

inputting the dynamic characteristic vector and the static characteristic vector of the target file into a preset classifier, and calculating the file content malicious probability of the target file;

and identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file, wherein the file source malicious probability of the target file is determined according to the source information of the target file.

Further, before determining whether the target file is a malicious file according to the file content malicious probability of the target file and the file source malicious probability of the target file, the method further includes:

acquiring source information of the target file;

and determining the file source malicious probability of the target file by matching the source information of the target file with malicious source data in a preset malicious source library.

Specifically, the acquiring of the dynamic characteristics of the target file includes:

putting the target file into a network sandbox system for execution to obtain a behavior log of the target file; the network sandbox system is composed of a virtual switching network composed of a group of virtual machines;

and acquiring the dynamic characteristic vector of the target file from the behavior log.

Further, the method further comprises:

training the preset classifier with malicious text samples and added noise.

Specifically, the training of the preset classifier through the malicious text and the added noise includes:

putting the malicious file sample and first noise added according to a preset noise knowledge base into a network sandbox system for execution to obtain a behavior log of the malicious file sample;

acquiring a dynamic characteristic vector of the malicious file sample through a behavior log of the malicious file sample and second noise added according to a preset noise knowledge base;

acquiring a static feature vector of the malicious file sample through the malicious file sample and third noise added according to a preset noise knowledge base;

acquiring a noise characteristic vector of the malicious file sample according to the static characteristic vector and the dynamic characteristic vector of the malicious file sample;

and training the preset classifier through the feature vector corresponding to the noise-free malicious file sample and the noise feature vector corresponding to the noise-added malicious file sample.

Specifically, the determining whether the target file is a malicious file according to the file content malicious probability and the file source malicious probability of the target file includes:

substituting the file content malicious probability of the target file, the file content malicious probability and the file source malicious probability of the target file into a Bayesian formula to calculate the malicious file probability of the target file when the target file is a malicious file;

and determining whether the target file is a malicious file according to the probability of the malicious file of the target file.

According to another aspect of the present invention, there is provided an apparatus for identifying a malicious file, including:

the acquiring unit is used for acquiring a dynamic characteristic vector and a static characteristic vector of the target file;

the calculation unit is used for inputting the dynamic characteristic vector and the static characteristic vector of the target file into a preset classifier and calculating the file content malicious probability of the target file;

and the identification unit is used for identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file, wherein the file source malicious probability of the target file is determined according to the source information of the target file.

Further, the apparatus further comprises:

the acquisition unit is further configured to acquire source information of the target file;

and the determining unit is used for determining the file source malicious probability of the target file by matching the source information of the target file with malicious source data in a preset malicious source library.

Specifically, the acquiring unit includes:

the execution module is used for putting the target file into a network sandbox system for execution; obtaining a behavior log of the target file; the network sandbox system is composed of a virtual switching network composed of a group of virtual machines;

and the acquisition module is used for acquiring the dynamic characteristic vector of the target file from the behavior log.

Further, the apparatus further comprises:

and the training unit is used for training the preset classifier through malicious text samples and added noise.

Specifically, the training unit includes:

the acquisition module is used for putting the malicious file sample and first noise added according to a preset noise knowledge base into a network sandbox system for execution to obtain a behavior log of the malicious file sample;

the acquisition module is used for acquiring a dynamic feature vector of the malicious file sample through a behavior log of the malicious file sample and second noise added according to a preset noise knowledge base;

the acquisition module is used for acquiring a static feature vector of the malicious file sample through the malicious file sample and third noise added according to a preset noise knowledge base;

the acquisition module is used for acquiring the noise characteristic vector of the malicious file sample according to the static characteristic vector and the dynamic characteristic vector of the malicious file sample;

and the training module is used for training the preset classifier through the feature vector corresponding to the noise malicious file sample which is not added and the noise feature vector corresponding to the noise malicious file sample which is added.

Specifically, the determining unit includes:

the calculation module is used for substituting the file content malicious probability of the target file, the file content malicious probability of the target file and the file source malicious probability of the target file into a Bayesian formula to calculate the malicious file probability of the target file when the target file is a malicious file;

and the determining module is used for determining whether the target file is a malicious file according to the malicious file probability of the target file.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

the method and the device for identifying the malicious file provided by the embodiment of the invention firstly obtain the dynamic feature vector and the static feature vector of the target file, then input the dynamic feature vector and the static feature vector of the target file into a preset classifier, calculate the file content malicious probability of the target file, and finally identify whether the target file is the malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file. Compared with the method for recognizing malicious codes mainly through a static monitoring technology or a dynamic monitoring technology at present, the method for recognizing malicious codes in the embodiment of the invention recognizes the target file through the deep learning method by combining the dynamic information, the static information and the environmental information of the target file, and solves the problems that the confused and shelled malicious file is easy to escape from a static check and the malicious file is easy to escape from a dynamic analysis check under the conditions of long-term latency and harsh trigger conditions, so that the capability of recognizing the malicious file is improved through the embodiment of the invention.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for identifying a malicious file according to an embodiment of the present invention;

FIG. 2 illustrates a schematic diagram of the additive noise provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating overall identification of malicious files provided by an embodiment of the present invention;

fig. 4 shows a block diagram of a malicious file identification apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram illustrating another malicious file identification apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present application, there is provided an embodiment of a method for identifying malicious files, it should be noted that the steps shown in the flowchart of the drawings may be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that shown.

In order to provide an implementation scheme for improving the identification accuracy of malicious files, embodiments of the present invention provide a method and an apparatus for identifying malicious files, and a preferred embodiment of the present invention is described below with reference to drawings of the specification.

An embodiment of the present invention provides a method for identifying a malicious file, as shown in fig. 1, the method includes:

101. and acquiring the dynamic characteristic vector and the static characteristic vector of the target file.

In step 101, a dynamic feature vector and a static feature vector are obtained according to a target file in a binary form. The analysis of the target file is divided into a dynamic analysis part and a static analysis part, the dynamic analysis utilizes the virtual execution capacity of the sandbox system to analyze the behavior of the target file in the operation period and obtain the dynamic characteristic vector of the target file from the analysis result, and the static analysis directly extracts the characteristics from the binary data of the target file to analyze and obtain the static characteristic vector of the target file from the characteristics.

For the embodiment of the invention, the dynamic feature vector and the static feature vector of the target file are obtained, namely, two methods of dynamic analysis and static analysis are adopted for analyzing the target file at the same time, mainly because the two methods have certain complementarity: some confused and shelled malicious files are easy to escape from static inspection, while some malicious files subjected to long-term latency and harsh trigger conditions are easy to escape from dynamic analysis inspection, and the combination of the two methods has better inspection effect in practice.

It should be noted that the sandbox system is a system with virtual operation capability capable of collecting and analyzing relevant behavior information. The sandbox system is generally implemented by relying on a set of managed virtual machines, the target file is automatically imported into the environment of the virtual machine for execution or opened by a corresponding program (such as an Office program), and an information collection agent running inside the virtual machine records and outputs the behavior of the target program in the running period. In the embodiment of the invention, after the target file is input into the sandbox system, the behavior record of the target file is output, and the output format is generally output in the form of log information. The log information recorded and output by the embodiment of the method includes but is not limited to: information of network access, call information to other application programs, access information to a file system, access information to a system registry, information of system call and information of all virtual machines with memory access, etc., and embodiments of the present invention are not particularly limited.

After the log corresponding to the target file is acquired through the sandbox system, the acquired log is converted into a dynamic feature vector which can be used for machine learning. In the embodiment of the invention, the conversion process of the log to the dynamic graph feature vector is divided into four parts: log normalization, feature extraction and dimension reduction.

The main functions of the log normalization are to remove special symbols in the log, to lower the upper case characters in the log, to replace the timestamp labels with a uniform format, and to replace the numbers therein with a uniform format. And the characteristic extraction part adopts different methods and a combination of the methods to extract the characteristics in the log, and the methods comprise but are not limited to: extracting statistical information of the time stamp; a document serialization tag extraction method (N-gram); the embodiments of the present invention are not limited specifically to (TF, TF-IDF) algorithms based on word frequency or based on word importance, and the like.

It should be noted that the purpose of dimension reduction is to reduce the feature vector with a higher dimension to a lower dimension, so as to improve the calculation efficiency of the subsequent machine learning algorithm and optimize the storage space. The dimension reduction method which can be adopted by the embodiment of the invention includes but is not limited to: PCA (principal Component Analysis) algorithm; LDA (topic model) algorithm; LLE (locally linear embedding algorithm) algorithm, etc., and the embodiment of the present invention is not particularly limited.

For the embodiment of the invention, the static features are features extracted directly on the basis of the binary target file and are output in a characteristic vector mode. And (4) performing static feature extraction on the target file, wherein the features are divided into binary features and disassembling features. Among these, binary features include, but are not limited to: a sequence feature extraction method (N-gram), a metadata extraction method of a file, an information entropy extraction method of a file, an image expression of a file, a length distribution of character strings in a file, and the like; the feature extraction based on disassembly includes but is not limited to: metadata information, symbol information, operator information, register information, API use information, segment structure information, data definition information, and the like, which are not specifically limited in the embodiments of the present invention.

102. And inputting the dynamic characteristic vector and the static characteristic vector of the target file into a preset classifier, and calculating the file content malicious probability of the target file.

In the embodiment of the invention, the preset classifier firstly combines the dynamic feature vector and the static feature vector of the target file to form a feature vector with a higher dimensionality, and then the feature vector is input into the deep neural network for classification to obtain the file content malicious probability corresponding to the target file, wherein the file content malicious probability is used for expressing the probability that the target file contains malicious content. The structure of the preset classifier is trained through an SDA (robust Denoising automatic encoder, each layer is based on a Denoising automatic coding algorithm), the SDA belongs to one of deep neural networks, the network structure is a multi-layer auto-encoders (automatic coding) neural network and a multi-layer fully-connected network with a dropout, and finally the output layer outputs the file content malicious probability of the target file through a sigmoid function.

The training process for the SDA is divided into two phases, a per-training phase and a fine-tuning phase. The Per-training phase is an unsupervised learning process, and aims to train initial parameters of an auto-encoders layer one by one, namely, the Per-training of the auto-encoder of the nth layer can be carried out after the first n-1 layers are determined. In the per-training process of each auto-encoder, parameters of an encoder-decoder are determined by training a three-layer neural network of the encoder-decoder, noise data are input into the three-layer neural network of the encoder-decoder, the comparison target is original data without noise, errors of the network output and the original data are iteratively minimized through a back propagation method, and the parameters of the encoder are finally obtained. The fine-tuning stage is performed after per-training of all layers is complete, and is a process with supervised learning, and the method of the fine-tuning stage is completely consistent with the back propagation process of a classical BP neural network and is used for finally fine-tuning each layer parameter.

103. And identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file.

The file source malicious probability of the target file is determined according to the source information of the target file, and the file source malicious probability is used for indicating the probability that the source of the target file is possibly a malicious source. The source information of the target file includes a URL (Uniform Resource Locator), an IP (Internet Protocol Address), a mail sender, and the like of the source of the target file.

It should be noted that, the malicious file content of the target file output by the preset classifier is a probability obtained by considering only the content of the target file. Although theoretically, whether a file belongs to a malicious file or not is completely determined by the content of the file, in fact, it is difficult to accurately identify the file based on the content of the file, and the identification needs to be assisted by environmental factors such as: the credibility of the source website and the credibility of the sender of the source mail have very good practical effect. Therefore, the embodiment of the invention identifies the target file according to the file content malicious probability of the target file and the file source malicious probability of the target file, and can improve the identification precision of the malicious file.

For the embodiment of the invention, whether the target file is a malicious file can be specifically identified through Bayes, namely, the malicious probability of the file content and the malicious probability of the file source of the target file are fused together according to a Bayes formula, so as to obtain the result whether the target file is a malicious file. The basic logic is based on Bayesian theorem:

p (m | s) is the probability that the result of the file is malicious when the malicious probability of the file content output by the preset classifier is s under a specific environment; when the target file is a malicious file, the probability of outputting s by the classifier is preset; p (m) is the probability that the source of the target file belongs to a malicious source under a specific environment, i.e. the file source malicious probability of the target file; and P(s) is the probability that the preset classifier outputs the file content malicious probability value of the target file to be s under a specific environment.

The embodiment of the invention provides a method for identifying a malicious file, which comprises the steps of firstly obtaining a dynamic feature vector and a static feature vector of a target file, then inputting the dynamic feature vector and the static feature vector of the target file into a preset classifier to obtain the file content malicious probability of the target file, and finally identifying whether the target file is the malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file. Because the file content malicious probability output by the preset separator is obtained under the condition that the source of the target file is not considered, and the information is insufficient, the file content malicious probability is only used as an intermediate result, and whether the target file is a malicious file is identified according to the intermediate result and the file source malicious probability of the target file, so that the identification precision of the malicious file is improved.

In order to better explain the identification method of the malicious file provided by the embodiment of the present invention, the following embodiment refines and expands the steps described above.

In the embodiment of the present invention, the process of obtaining the file source malicious probability of the target file is as follows: acquiring source information of the target file; and determining the file source malicious probability of the target file by matching the source information of the target file with malicious source data in a preset malicious source library.

The preset malicious source library is a source environment factor for storing a target file, and stores the probability of the file being a malicious file under a certain specific condition. These specific conditions include, but are not limited to: IP information, URL information, mail sender information. The generation process of the preset malicious source library is used as an external independent system, and can be obtained by self-building and purchasing the commercial malicious source library and participating in sharing of some security associations, and the embodiment of the invention is not particularly limited. When the environmental information of a target file matches with multiple types of entries in the preset malicious source library, the multiple types of entries need to be combined, for example: the source IP and the sender of a target file are matched with a preset malicious source library, the output probabilities are a and b respectively, and then the preset malicious source library needs to output the combination of the two probabilities, namely the file source malicious probability of the target file is 1- (1-a) (1-b).

It should be noted that, since a considerable part of malicious files only take some hosts as jumpers to penetrate other hosts, and their true malicious behavior is only expressed on the latter, the mode of this type of jumpers attack is very common in modern Advanced Persistent Threats (APT). Existing sandboxing systems only emphasize the emulation of the host environment of the target file and ignore the emulation of its network environment, which makes the system less able to identify this type of malware.

In order to solve the problem, the embodiment of the invention obtains the dynamic feature vector of the target file through the network sandbox system, namely, the target file is put into the network sandbox system to be executed, and the behavior log of the target file is obtained; the network sandbox system is composed of a virtual switching network composed of a group of virtual machines; and acquiring the dynamic characteristic vector of the target file from the behavior log.

The network sandbox system in the embodiment of the invention highly simulates a real network environment, the network sandbox system is composed of a virtual switching network composed of a group of virtual machines, common enterprise-level services and systems (such as Windows update server, Oracle, Exchange and the like) are deployed on different virtual machines in the network, an agent program for information collection is run on each virtual machine in the network sandbox system, and when a target file permeates another virtual machine in the sandbox on a host virtual machine (such as a remote overflow attack), the information collection agent program running on the latter records abnormal behaviors.

Because the network sandbox system in the embodiment of the invention replaces the original sandbox system of a single virtual machine by the network formed by n virtual machines, the identification capability is improved, and at the same time, n times of resource cost is paid. Therefore, in order to solve the problem, in the training period, the embodiment of the invention adopts the pure environment that each sandbox is composed of n virtual machines to train so as to improve the identification precision of the malicious files; and in the identification period, the n virtual machines simultaneously process the n files, for example, each virtual machine runs one file, so that the identification efficiency of the malicious files is improved, and when the malicious files are finally identified, the system needs to process the n target files again in a pure environment to determine which malicious file is the malicious file. It should be noted that, in reality, most target files are not malicious files, so the embodiment of the present invention processes n target files simultaneously through n virtual machines in the identification period, which can improve the identification efficiency of the malicious files, and if the malicious files exist in the n virtual machines, the n target files are processed again in a pure environment to determine which is the malicious file; and if the malicious files do not exist in the n virtual machines, continuously processing the next batch of target files through the n virtual machines.

For example, if 20 target files exist and it is required to determine whether malicious content is contained, and the network sandbox system includes 10 virtual machines, the 10 target files are firstly averagely allocated to the 10 virtual machines for execution, if no problem is found in the execution, the next 10 target files are continuously averagely allocated to the 10 virtual machines for execution, and at this time, if a problem target file is found, the next 10 target files are processed again in a clean environment to determine which malicious file is the last target file.

Because the existing abnormal patterns are trained by real malicious file samples, a normalization process is lacked for the malicious files, and the identification capability of the malicious file variants is limited. Therefore, in order to solve the problem, in the process of training the preset classifier, the embodiment of the invention adds specific noise to make the preset classifier have good identification capability on the varieties, namely, the preset classifier is trained through malicious texts and the added noise.

Specifically, the preset classifier is trained through the SDA algorithm, and an important reason for selecting the SDA is that the noise resistance of the system can be improved through artificially increasing noise in the per-training stage, and the noise resistance has a very good effect on identifying the variation and the escape of the malicious software. The process of anti-noise is reflected as: input vector x, generated by a process of adding noise

Z is generated by the process of encoder and decoder, and the error function is defined as the difference between x and z. The encoder is made robust against noise by iteratively minimizing the error function.

In the embodiment of the present invention, x is the direct input of the preset classifier, i.e. the feature vector of the target file extracted by the static and dynamic features. While

The noise is generated by a noise system through a characteristic vector with noise generated in a specific mode, and the noise is generated when a malicious file sample enters a network sandNoise processing is added at three points before the box system, before static feature extraction is entered, and after the network sandbox system.

As shown in FIG. 2, a noisy feature vector of a malicious file sample is obtained

The specific process is as follows: putting the file sample and first noise added according to a preset noise knowledge base into a network sandbox system for execution to obtain a behavior log of the malicious file sample; acquiring a dynamic characteristic vector of the malicious file sample through a behavior log of the malicious file sample and second noise added according to a preset noise knowledge base; acquiring a static feature vector of the malicious file sample through the malicious file sample and third noise added according to a preset noise knowledge base; obtaining a noise characteristic vector of the malicious file sample according to the static characteristic vector and the dynamic characteristic vector of the malicious file sample

It should be noted that the noise system adds noise according to a preset rule in a preset noise knowledge base, where the rule is a set of actions formulated according to the experience of a malicious file analyst, and these actions can produce different characteristic results without affecting the nature of the malicious software. Some simple rules are for example: the "malware is still virus after being compressed or shelled", "the log in which normal software is inserted in the sandbox log of the malware is still malware", and the like, and the embodiment of the present invention is not particularly limited. These rules do not need to be guaranteed to be absolutely correct, but only need to be correct with a high probability, and a small amount of errors generated by the rules can be eliminated by a subsequent neural network. The function of the noise system is to select one or more rules in the knowledge base to act on the target data. The noise system only works during the training period of the system and does not work any more during the operation period of the classifier.

For the embodiment of the present invention, after the noise feature vector corresponding to the file sample is obtained, the preset classifier is trained by the feature vector corresponding to the noise-free malicious file sample and the noise feature vector corresponding to the noise-free malicious file sample, and the obtaining process of the feature vector of the noise-free malicious file sample is the same as the obtaining process of the noise feature vector of the noise-free malicious file sample, which is not described herein again. After the whole preset classifier finishes a training stage, parameters of all layers of the classifier are determined, the preset classifier can enter an operation stage, namely classification and identification are carried out through the preset classifier, the operation stage process is a forward propagation process of a typical neural network, and finally the file content malicious probability of the identified target file is output through a sigmoid function.

It is elaborated that, by performing noise addition on a malicious file sample, one malicious file sample can generate a plurality of noisy feature vectors. The reason for this is: firstly, a sample of a malicious file is difficult to obtain compared with a sample of a normal file, and the method is beneficial to relieving the problem of sample imbalance; secondly, compared with malicious files, normal files often have no escape behavior, and the improvement of the final recognition effect by adding noise to the normal files is not obvious. Therefore, the embodiment of the invention greatly improves the normalization capability of the preset classifier on the identification of malicious files by introducing a noise system, and greatly improves the identification of variants and escapes.

It should be described in detail that the determining whether the target file is a malicious file according to the file content malicious probability and the file source malicious probability of the target file includes: substituting the file content malicious probability of the target file, the file content malicious probability of the target file and the file source malicious probability of the target file into a Bayesian formula to calculate the malicious file probability of the target file when the target file is a malicious file; and determining whether the target file is a malicious file according to the probability of the malicious file of the target file. Namely, the basic logic of the embodiment of the invention is based on Bayesian theorem:

calculating the probability of malicious files of the target file, wherein P (m | s) is the probability that the malicious file is generated when the malicious probability of the file content output by the preset classifier is s under a specific environment; when the target file is a malicious file, the probability of outputting s by the classifier is preset; p (m) is the probability that the source of the target file belongs to a malicious source under a specific environment, i.e. the file source malicious probability of the target file; and P(s) is the probability that the preset classifier outputs the file content malicious probability value of the target file to be s under a specific environment. This probability is transformed to the following equation:

further:

where b represents a non-malicious file. In the above formula, P (m) is a file source malicious probability of a target file output according to a preset malicious source library, and P (s | m) and P (s | b) need to be known to obtain P (m | s), and the two probabilities are estimated by a probability density estimation method, which may specifically be a histogram method, a kernel method, and other technologies. And then the probability of the malicious file taking P (m | s) as the target software is obtained, so that the probability generated by the file content probability and the environmental factors obtained by the preset classifier is fused, and the identification precision of the malicious file is improved.

For the embodiment of the present invention, an applicable scenario is shown in fig. 3, but is not limited thereto, and includes: the input information in the application scene is divided into two parts, one part is a target file, namely the binary form of the target file, and the other part is the source information of the target file, including the URL, IP, mail sender and the like of the source. The analysis of the target file is divided into a dynamic analysis part and a static analysis part, the dynamic analysis utilizes the virtual execution capacity of the network sandbox system to analyze the behavior of the target file in the operation period and obtain the dynamic characteristic vector of the target file from the analysis result, and the static analysis directly extracts the static characteristic vector from the binary data of the target file. Inputting the feature vectors extracted by the dynamic analysis and the static analysis into a preset classifier for classification to obtain the file content malicious probability of the target file, determining the file source malicious probability of the target file by matching the source information of the target file with malicious source data in a preset malicious source library, and finally obtaining the final malicious probability result of the target file by carrying out Bayesian calculation on the file content malicious probability of the target file and the file source malicious probability of the target file.

In the application scenario, the target file analysis adopts both dynamic analysis and static analysis methods, mainly because the two methods have certain complementarity: some confused and shelled malicious programs are easy to escape from static inspection, while some malicious programs subjected to long-term latency and harsh trigger conditions are easy to escape from dynamic analysis inspection, and the combination of the two methods has better inspection effect in practice. In addition, the probability of the output of the preset classifier is obtained under the condition that the source of the file is not considered, the information is not sufficient, and therefore the preset classifier is only used as an intermediate result in the application scene. And determining whether the target file is a malicious file according to the intermediate result and the probability provided by the preset malicious source library, thereby improving the identification precision of the malicious file.

Further, an embodiment of the present invention provides an apparatus for identifying a malicious file, and as shown in fig. 4, the apparatus includes: an acquisition unit 21, a calculation unit 22, and an identification unit 23.

An obtaining unit 21, configured to obtain a dynamic feature vector and a static feature vector of a target file;

the obtaining unit 21 obtains the dynamic feature vector and the static feature vector according to the target file in the binary form. The analysis of the target file is divided into a dynamic analysis part and a static analysis part, the dynamic analysis utilizes the virtual execution capacity of the sandbox system to analyze the behavior of the target file in the operation period and obtain the dynamic characteristic vector of the target file from the analysis result, and the static analysis directly extracts the characteristics from the binary data of the target file to analyze and obtain the static characteristic vector of the target file from the characteristics.

For the embodiment of the present invention, the obtaining unit 21 obtains the dynamic feature vector and the static feature vector of the target file, that is, two methods, namely, a dynamic analysis method and a static analysis method, are adopted for analyzing the target file, mainly because the two methods have certain complementarity: some confused and shelled malicious files are easy to escape from static inspection, while some malicious files subjected to long-term latency and harsh trigger conditions are easy to escape from dynamic analysis inspection, and the combination of the two methods has better inspection effect in practice.

The calculating unit 22 is configured to input the dynamic feature vector and the static feature vector of the target file into a preset classifier, and calculate a file content malicious probability of the target file;

the calculation unit 22 first combines the dynamic feature vector and the static feature vector of the target file through a preset classifier to form a feature vector with a higher dimensionality, and then inputs the feature vector into the deep neural network for classification to obtain the file content malicious probability corresponding to the target file, wherein the file content malicious probability is used for representing the probability that the target file contains malicious content. The structure of the preset classifier is trained through an SDA (robust Denoising automatic encoder, each layer is based on a Denoising automatic coding algorithm), the SDA belongs to one of deep neural networks, the network structure is a multi-layer auto-encoders (automatic coding) neural network and a multi-layer fully-connected network with a dropout, and finally the output layer outputs the file content malicious probability of the target file through a sigmoid function.

The identifying unit 23 is configured to identify whether the target file is a malicious file according to the file content malicious probability of the target file and the file source malicious probability of the target file, where the file source malicious probability of the target file is determined according to the source information of the target file.

The embodiment of the invention provides a malicious file identification device, which comprises the steps of firstly obtaining a dynamic feature vector and a static feature vector of a target file, then inputting the dynamic feature vector and the static feature vector of the target file into a preset classifier to obtain the malicious probability of the file content of the target file, and finally identifying whether the target file is a malicious file or not according to the malicious probability of the file content of the target file and the malicious probability of the file source of the target file. Because the file content malicious probability output by the preset separator is obtained under the condition that the source of the target file is not considered, and the information is insufficient, the file content malicious probability is only used as an intermediate result, and whether the target file is a malicious file is identified according to the intermediate result and the file source malicious probability of the target file, so that the identification precision of the malicious file is improved.

Further, as shown in fig. 5, the apparatus further includes:

the obtaining unit 21 is further configured to obtain source information of the target file;

the determining unit 24 is configured to determine a file source malicious probability of the target file by matching the source information of the target file with malicious source data in a preset malicious source library.

In order to solve this problem, in the embodiment of the present invention, the dynamic feature vector of the target file is obtained through a network sandbox system, as shown in fig. 5, the obtaining unit 21 includes: the execution module 211 is configured to place the target file into a network sandbox system for execution; obtaining a behavior log of the target file; the network sandbox system is composed of a virtual switching network composed of a group of virtual machines; an obtaining module 212, configured to obtain a dynamic feature vector of the target file from the behavior log.

Because the existing abnormal patterns are trained by real malicious file samples, a normalization process is lacked for the malicious files, and the identification capability of the malicious file variants is limited. Therefore, to solve this problem, the embodiment of the present invention adds specific noise to make it have a good recognition capability for the variants during training the pre-set classifier, i.e. training the pre-set classification model by the training unit 25. A training unit 25 for training the preset classifier by means of malicious text samples and added noise.

Specifically, as shown in fig. 5, the training unit 25 includes:

the obtaining module 251 is configured to place the malicious file sample and the first noise added according to the preset noise knowledge base into a network sandbox system for execution, so as to obtain a behavior log of the malicious file sample;

the obtaining module 251 is configured to obtain a dynamic feature vector of the malicious file sample through a behavior log of the malicious file sample and a second noise added according to a preset noise knowledge base;

the obtaining module 251 is configured to obtain a static feature vector of the malicious file sample through the malicious file sample and a third noise added according to a preset noise knowledge base;

the obtaining module 251 is configured to obtain a noise feature vector of the malicious file sample according to the static feature vector and the dynamic feature vector of the malicious file sample;

the training module 252 is configured to train the preset classifier by using the feature vector corresponding to the noise-free malicious file sample and the noise feature vector corresponding to the noise-added malicious file sample.

Specifically, as shown in fig. 5, the determining unit 24 includes:

a calculating module 241, configured to substitute a file content malicious probability of the target file, and a file source malicious probability of the target file into a bayesian formula to calculate a malicious file probability of the target file when the target file is a malicious file;

a determining module 242, configured to determine whether the target file is a malicious file according to the malicious file probability of the target file.

further:

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the channel switching method and apparatus for digital televisions according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for identifying malicious files, comprising:

identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file, wherein the file source malicious probability of the target file is determined according to the source information of the target file, and when the source information of the target file is matched with a plurality of types of entries in a preset malicious source library, the file source malicious probability of the target file is obtained based on the combination of the plurality of types of contents;

the method further comprises the following steps:

training the preset classifier through a malicious text sample and added noise;

the training of the preset classifier through malicious text and added noise comprises:

2. The method of claim 1, wherein before identifying whether the target file is a malicious file according to the file content malicious probability of the target file and the file source malicious probability of the target file, the method further comprises:

acquiring source information of the target file;

3. The method of claim 1, wherein the obtaining the dynamic feature vector of the target file comprises:

and acquiring the dynamic characteristic vector of the target file from the behavior log of the target file.

4. The method of claim 1, wherein identifying whether the target file is a malicious file according to the file content malicious probability of the target file and the file source malicious probability of the target file comprises:

substituting the file content malicious probability of the target file, the file content malicious probability of the target file and the file source malicious probability of the target file into a Bayesian formula to calculate the malicious file probability of the target file when the target file is a malicious file;

5. An apparatus for identifying malicious files, comprising:

the calculation unit is used for inputting the dynamic characteristic vector and the static characteristic vector of the target file into a preset classifier to obtain the file content malicious probability of the target file;

the identification unit is used for identifying whether the target file is a malicious file or not according to the file content malicious probability of the target file and the file source malicious probability of the target file, wherein the file source malicious probability of the target file is determined according to the source information of the target file, and when the source information of the target file is matched with a plurality of types of entries in a preset malicious source library, the file source malicious probability of the target file is obtained based on the combination of the plurality of types of contents;

the device further comprises:

the training unit is used for training the preset classifier through malicious text samples and added noise;

the training unit includes: the acquisition module is used for putting the malicious file sample and first noise added according to a preset noise knowledge base into a network sandbox system for execution to obtain a behavior log of the malicious file sample;

6. The apparatus of claim 5, further comprising:

7. The apparatus of claim 6, wherein the obtaining unit comprises:

the execution module is used for putting the target file into a network sandbox system for execution to obtain a behavior log of the target file; the network sandbox system is composed of a virtual switching network composed of a group of virtual machines;

and the acquisition module is used for acquiring the dynamic characteristic vector of the target file from the behavior log of the target file.

8. The apparatus of claim 6, wherein the determining unit comprises: