CN106778241A

CN106778241A - The recognition methods of malicious file and device

Info

Publication number: CN106778241A
Application number: CN201611067380.6A
Authority: CN
Inventors: 杜强
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-11-28
Filing date: 2016-11-28
Publication date: 2017-05-31
Anticipated expiration: 2036-11-28
Also published as: CN106778241B

Abstract

Recognition methods and device the invention discloses a kind of malicious file, are related to computer security technique field, the accuracy of identification for improving malicious file, and main technical schemes of the invention are：Obtain the behavioral characteristics vector and static nature vector of file destination；The behavioral characteristics vector and static nature vector of the file destination are input in preset grader, the file content malice probability of the file destination is calculated；The document source malice probability of file content malice probability and the file destination according to the file destination recognizes whether the file destination is malicious file, and the document source malice probability of the file destination is determined according to the source-information of file destination.Present invention is mainly used for identification malicious file.

Description

The recognition methods of malicious file and device

Technical field

The present invention relates to computer security technique field, more particularly to a kind of malicious file recognition methods and device.

Background technology

With continuing to develop for computer and Internet technology, malicious file also into explosive growth, its attack means with Camouflage means also constantly develop towards variation and the mode for complicating.In addition the underground industry chain (supply chain) of computer crime is continuous Perfect, industrialization is increasingly lifted with the degree of scale, and this causes that confrontation malice document turns into current one and has very much challenge Problem.

At present, malicious code is mainly recognized by static monitoring techniques technology or Dynamic Monitoring, static monitoring techniques technology is to mesh Signatures match is carried out after the pretreatment for marking file, that is, matches virus base；Dynamic Monitoring is mainly according to file destination Some behavioural characteristics, for example, change the information such as specific registration table, open particular port and be identified.

But, mutation of the static monitoring techniques technology to malicious file, new malicious file lacks detectability, Dynamic Monitoring Lack recognition capability for not possessing evident act feature and new Malware.And it is currently limited to single static monitoring techniques skill Art or Dynamic Monitoring, this causes that malicious file is easier to hide oneself using some general escape technologies, therefore The accuracy of identification of malicious file is relatively low.

The content of the invention

In view of this, the present invention provides recognition methods and the device of a kind of malicious file, and main purpose is to improve malice The accuracy of identification of file.

According to one aspect of the invention, there is provided a kind of recognition methods of malicious file, including：

Obtain the behavioral characteristics vector and static nature vector of file destination；

The behavioral characteristics vector and static nature vector of the file destination are input in preset grader, calculate described The file content malice probability of file destination；

The document source malice probability of file content malice probability and the file destination according to the file destination is known Whether not described file destination is malicious file, the document source malice probability of the file destination be according to file destination come What source information determined.

Further, it is described according to the file content malice probability of the file destination and the file of the file destination come Before whether file destination described in the malice determine the probability of source is malicious file, methods described also includes：

Obtain the source-information of the file destination；

By matching the malicious origin data in the source-information of the file destination and preset malicious origin storehouse, institute is determined State the document source malice probability of file destination.

Specifically, the behavioral characteristics for obtaining file destination include：

The file destination is put into network sandbox system and is performed, obtain the user behaviors log of the file destination；It is described Network sandbox system by one group of Imaginary Mechanism into a virtual switch network constitute；

The behavioral characteristics vector of the file destination is obtained from the user behaviors log.

Further, methods described also includes：

The preset grader is trained by malice samples of text and the noise of addition.

Specifically, described included by malice text and the noise of the addition training preset grader：

Network sandbox system is put into by the malicious file sample and according to the first noise that preset noise knowledge base is added Middle execution, obtains the user behaviors log of the malicious file sample；

The second noise added by the user behaviors log of the malicious file sample and according to preset noise knowledge base is obtained The behavioral characteristics vector of the malicious file sample；

The 3rd noise added by the malicious file sample and according to preset noise knowledge base obtains the malice text The static nature vector of part sample；

Static nature vector and behavioral characteristics vector according to the malicious file sample obtain the malicious file sample Noise feature vector；

It is corresponding with addition noise malicious file sample by being not added with the corresponding characteristic vector of noise malicious file sample Noise feature vector trains the preset grader.

Specifically, the document source malice probability according to the file content malice probability and the file destination is true Whether the fixed file destination is that malicious file includes：

File content malice probability, the file content of file destination be maliciously when by the file destination being malicious file The document source malice probability of probability and the file destination substitutes into the malicious file that Bayesian formula calculates the file destination Probability；

Whether file destination is malicious file described in malicious file determine the probability according to the file destination.

According to another aspect of the invention, there is provided a kind of identifying device of malicious file, including：

Acquiring unit, behavioral characteristics vector and static nature vector for obtaining file destination；

Computing unit, for the behavioral characteristics vector and static nature vector of the file destination to be input into preset classification In device, the file content malice probability of the file destination is calculated；

Recognition unit, for the file content malice probability and the file of the file destination according to the file destination come Source malice probability recognizes whether the file destination is malicious file, and the document source malice probability of the file destination is basis What the source-information of file destination determined.

Further, described device also includes：

The acquiring unit, is additionally operable to obtain the source-information of the file destination；

Determining unit, for by match the malice in the source-information of the file destination and preset malicious origin storehouse come Source data, determines the document source malice probability of the file destination.

Specifically, the acquiring unit includes：

Performing module, performs for the file destination to be put into network sandbox system；Obtain the file destination User behaviors log；The network sandbox system by one group of Imaginary Mechanism into a virtual switch network constitute；

Acquisition module, the behavioral characteristics vector for obtaining the file destination from the user behaviors log.

Further, described device also includes：

Training unit, for training the preset grader by malice samples of text and the noise of addition.

Specifically, the training unit includes：

Acquisition module, for being put into by the malicious file sample and according to the first noise that preset noise knowledge base is added Performed in network sandbox system, obtain the user behaviors log of the malicious file sample；

The acquisition module, adds for the user behaviors log by the malicious file sample and according to preset noise knowledge base Plus the second noise obtain the behavioral characteristics vector of the malicious file sample；

The acquisition module, for being added by the malicious file sample and according to preset noise knowledge base the 3rd is made an uproar Sound obtains the static nature vector of the malicious file sample；

The acquisition module, obtains for the static nature vector and behavioral characteristics vector according to the malicious file sample The noise feature vector of the malicious file sample；

Training module, for by being not added with the corresponding characteristic vector of noise malicious file sample and addition noise malice text The corresponding noise feature vector of part sample trains the preset grader.

Specifically, the determining unit includes：

Computing module, file content malice probability, the institute of file destination during for being malicious file by the file destination The document source malice probability of the file content malice probability and the file destination of stating file destination substitutes into Bayesian formula meter Calculate the malicious file probability of the file destination；

Whether determining module, be malice for file destination described in the malicious file determine the probability according to the file destination File.

By above-mentioned technical proposal, technical scheme provided in an embodiment of the present invention at least has following advantages：

A kind of recognition methods of malicious file provided in an embodiment of the present invention and device, obtain the dynamic of file destination first , then be input to for the behavioral characteristics vector and static nature vector of the file destination pre- by characteristic vector and static nature vector Put in grader, the file content malice probability of the file destination is calculated, finally according to the file content of the file destination The document source malice probability of malice probability and the file destination recognizes whether the file destination is malicious file.With it is current Main to recognize that malicious code is compared by static monitoring techniques technology or Dynamic Monitoring, the embodiment of the present invention utilizes deep learning side Method is identified by the multidate information of combining target file, static information and environmental information to file destination, is solved and is passed through Obscure, the malicious file of shell adding is easier static inspection of escaping, and by the harsh evil of long-term latent and trigger condition Meaning file easily escape dynamic analysis inspection, so as to improve the recognition capability to malicious file by the embodiment of the present invention.

Described above is only the general introduction of technical solution of the present invention, in order to better understand technological means of the invention, And can be practiced according to the content of specification, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by specific embodiment of the invention.

Brief description of the drawings

By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings：

Fig. 1 shows a kind of recognition methods flow chart of malicious file provided in an embodiment of the present invention；

Fig. 2 shows addition noise schematic diagram provided in an embodiment of the present invention；

Fig. 3 shows the schematic diagram of overall identification malicious file provided in an embodiment of the present invention；

Fig. 4 shows a kind of identifying device structured flowchart of malicious file provided in an embodiment of the present invention；

Fig. 5 shows the identifying device structured flowchart of another malicious file provided in an embodiment of the present invention.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so using Data can exchange in the appropriate case, so as to embodiments herein described herein can with except illustrating herein or Order beyond those of description is implemented.Additionally, term " comprising " and " having " and their any deformation, it is intended that cover Lid is non-exclusive to be included, for example, the process, method, system, product or the equipment that contain series of steps or unit are not necessarily limited to Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or other intrinsic steps of equipment or unit.

According to the embodiment of the present application, there is provided a kind of recognition methods embodiment of malicious file, it is necessary to explanation, attached The step of flow of figure is illustrated can perform in the such as one group computer system of computer executable instructions, though also, So logical order is shown in flow charts, but in some cases, can be with shown different from order execution herein Or the step of description.

In order to provide the implementation of the recognition accuracy for improving malicious file, a kind of malice is the embodiment of the invention provides The preferred embodiments of the present invention are illustrated by the recognition methods of file and device below in conjunction with Figure of description.

A kind of recognition methods of malicious file is the embodiment of the invention provides, as shown in figure 1, the method includes：

101st, the behavioral characteristics vector and static nature vector of file destination are obtained.

Wherein, step 101 be according to the file destination of binary form obtain behavioral characteristics vector and static nature to Amount.Analysis to file destination, be divided into again dynamic analysis and static analysis two parts, dynamic analysis and utilization sandbox system it is virtual Executive capability analyzes behavior of the file destination in the runtime, and the behavioral characteristics vector of file destination is obtained from analysis result, And static analysis is then directly extracted feature on the binary data of file destination and is analyzed, and therefrom obtain file destination Static nature vector.

For the embodiment of the present invention, the behavioral characteristics vector and static nature vector of file destination are obtained, i.e., to target text Part analysis employs two methods of dynamic analysis and static analysis simultaneously, and being primarily due to two methods has certain complementation Property：Some are by obscuring, the malicious file of shell adding is easier the static inspection of escape, and some hide and triggering by long-term The then easily inspection of escape dynamic analysis of the harsh malicious file of condition, and the combination of two methods is in practice with more preferable Checking on effect.

It should be noted that sandbox system be can be collected with virtual operation ability and analyze corelation behaviour information be System.The realization of sandbox system is usually to rely on one group of managed virtual machine, and what file destination was automated imported into virtual Performed in the environment of machine or (such as Office programs) is opened by corresponding program, operate in the Information gather agent of virtual machine internal The behavior record of the runtime of target program is got off and exported.In embodiments of the present invention, file destination is input to sandbox After system, can export file destination is behavior record, and the form of output is general to be exported in the form of log information.This law is implemented Example record and the log information of output are included but are not limited to：The information of network access, the recalls information to other application programs, Access information to file system, the access information to system registry, to all void such as system call information and internal storage access The information of plan machine, the embodiment of the present invention is not specifically limited.

After the corresponding daily record of file destination is got by sandbox system, the daily record that will be got is converted to can be used In the behavioral characteristics vector of machine learning.In embodiments of the present invention, for the mistake of daily record to the conversion of Dynamic Graph characteristic vector Journey is divided into four：Daily record standardization, feature extraction and dimensionality reduction.

Wherein, daily record standardize main function be remove additional character in daily record, character that will capitalize in daily record it is small Writing, time stamp label is substituted for consolidation form, and numeral therein is substituted for unified form.Feature extraction unit Point, the feature in daily record is extracted using the combination of different methods and these methods, these methods are included but are not limited to： Extraction to the demographic information of timestamp；Serializing tag extraction method (N-gram) of document；It is based on word frequency or be based on (TF, TF-IDF) algorithm of word importance degree etc., the embodiment of the present invention is not specifically limited.

It should be noted that the purpose of dimensionality reduction is by the characteristic vector of higher dimensional, it is reduced to compared with low dimensional, after lifting The computational efficiency and optimization memory space of continuous machine learning algorithm.The dimension reduction method that the embodiment of the present invention can be used include but not It is only limitted to：PCA (principal Component Analysis, principal component analysis method) algorithm；LDA(Latent Dirichlet Allocation, topic model) algorithm；(locallylinearembedding is locally linear embedding into calculation to LLE Method) algorithm etc., the embodiment of the present invention does not do specific restriction.

For the embodiment of the present invention, static nature is the feature directly extracted on the basis of binary object file, and Exported in the way of eigen vector.Static nature extraction is carried out to file destination, feature is divided into binary features and anti-remittance Compile the class of feature two.Wherein, binary features are included but are not limited to：First number of sequence characteristic extracting method (N-gram), file Distribution of lengths according to character string in extracting method, the extracting method of the comentropy of file, the image expression of file, file etc.；Base Included but are not limited in the feature extraction of dis-assembling：Metadata information, symbolic information, operator information, register information, API use informations, segment structure information, data definition information etc., the embodiment of the present invention is not specifically limited.

102nd, the behavioral characteristics vector and static nature vector of the file destination are input in preset grader, are calculated The file content malice probability of the file destination.

In embodiments of the present invention, preset grader is first by the behavioral characteristics vector and static nature vector of file destination It is combined, forms a more high-dimensional characteristic vector, be then input to be classified in deep neural network, obtains target The corresponding file content malice probability of file, file content malice probability is used to represent in file destination comprising the general of hostile content Rate.By SDA, (Stack Denoising Autoencoder, every layer is calculated the structure of preset grader with denoising autocoding Method) it is trained, SDA belongs to the one kind in deep neural network, and its network structure is that the auto-encoders of multilayer is (automatic Coding) neutral net plus multilayer band dropout fully-connected network, last output layer by sigmoid functions export mesh Mark the file content malice probability of file.

The training process of SDA is divided into two stages, respectively per-training (pre-training) stages and fine-tuning (fine setting) stage.The Per-training stages are a unsupervised learning processes, it is therefore an objective to training auto- in layer The initial parameter of encoders (autocoder) layer, that is, determine it is first n-1 layers after can just be the auto-encoder of n-th layer per-training.During every layer of per-training of auto-encoder, by training an encoder- The three-layer neural network of decoder determines its parameter, and mode is to three layers of encoder-decoder by the data with noise Neutral net is input into, and the target for contrasting is initial data without noise, by the method for backpropagation, iteratively most Smallization network exports the error with initial data, finally gives the parameter of encoder.The fine-tuning stages are in whole layers Per-training is completely carried out afterwards, is a process for supervised learning, and its method is anti-with classical BP neural network It is completely the same to communication process, for finally finely tuning each layer parameter.

103rd, according to the file destination file content malice probability and the document source malice of the file destination is general Rate recognizes whether the file destination is malicious file.

Wherein, the document source malice probability of the file destination is determined according to the source-information of file destination, text Part source malice probability is used to represent that the source of file destination is likely to be the probability of malicious origin.The source-information of file destination URL (Uniform Resource Locator, URL), IP (the Internet Protocol originated including it Address, internet protocol address), e-mail sender etc., the embodiment of the present invention is not specifically limited.

It should be noted that the file content malice of the file destination of preset grader output is to consider only target Probability is obtained in the case of file content.Although theoretically whether a file belongs to malicious file and is determined by its content completely Fixed, but be in fact difficult to be based only on its content and accomplish high-precision identification, this is accomplished by auxiliary with environmental factor, for example：Come The confidence level of source website, the confidence level of source e-mail sender, the fusion of these environmental factors have extraordinary practice effect. Therefore, the embodiment of the present invention is according to the file content malice probability of file destination and the document source malice probability pair of file destination File destination is identified, and can improve the accuracy of identification of malicious file.

Whether can be specifically malicious file by Bayesian recognition file destination, i.e., according to shellfish for the embodiment of the present invention File content malice probability together with the document source malice probability fusion of file destination, is drawn file destination by this formula of leaf Whether be malicious file result.Its basic logic is based on Bayes' theorem：

Wherein, P (m | s) is in certain circumstances, when the file content malice probability of preset grader output is s, its knot Fruit is the probability of malicious file；P (s | m) for file destination be malicious file when, preset grader exports the probability of s；P (m) is Under specific environment, the source of file destination belongs to the document source malice probability of the probability of malicious origin, i.e. file destination；P(s) It is that in certain circumstances, the file content malice probability numbers of preset grader output file destination are the probability of s.

The embodiment of the present invention provides a kind of recognition methods of malicious file, and the behavioral characteristics vector of file destination is obtained first And static nature vector, the behavioral characteristics vector and static nature vector of the file destination are then input to preset grader In, the file content malice probability of the file destination is obtained, finally according to the file content malice probability of the file destination Recognize whether the file destination is malicious file with the document source malice probability of the file destination.Due to preset separator The file content malice probability of output is not accounting for being drawn in the case that file destination is originated that its information is not filled Point, thus the embodiment of the present invention using file content malice probability only as an intermediate result, according to this intermediate result and mesh Whether the document source malice probability identification file destination for marking file is malicious file, so as to improve the identification essence of malicious file Degree.

In order to the recognition methods preferably to malicious file provided in an embodiment of the present invention is illustrated, following examples will Refined for above steps and extended.

In embodiments of the present invention, the acquisition process of the document source malice probability of file destination is：Obtain the target The source-information of file；By matching the malicious origin number in the source-information of the file destination and preset malicious origin storehouse According to determining the document source malice probability of the file destination.

Wherein, preset malicious origin storehouse is the source environmental factor for storing file destination, which stores specific based on certain Under the conditions of file be malicious file probability.These specified conditions are included but are not limited to：IP information, URL information, mail outbox People's information.The generating process in preset malicious origin storehouse, as an independent system for outside, can be by self-built, purchase business The malicious origin storehouse of industry and participate in the shared of some Security Associations and obtain, the embodiment of the present invention is not specifically limited.When one The environmental information of individual file destination is when matching the polytype entry in preset malicious origin storehouse, it is necessary to be combined Come, for example：The source IP of one file destination and sender have matched preset malicious origin storehouse, and the probability of output is respectively A and b, then preset malicious origin storehouse needs the output that two probabilistic combinations get up, the i.e. document source of file destination maliciously Probability is 1- (1-a) (1-b).

It should be noted that because quite a few malicious file is only using some main frames as springboard, going infiltration another A little main frames, and its real malicious act is just showed only on the latter, the pattern that this kind of springboard is attacked is senior lasting the modern times It is very common in threat (APT, Advanced Persistent Threa).Existing sandbox system is only highlighted to target The emulation of the hosted environment of file, and the emulation to its network environment is ignored, this causes that system lacks to this kind of Malware Recognition capability.

In order to solve this problem, the embodiment of the present invention by network sandbox system obtain the behavioral characteristics of file destination to Amount, will the file destination be put into network sandbox system and perform, obtain the user behaviors log of the file destination；The network Sandbox system by one group of Imaginary Mechanism into a virtual switch network constitute；The target text is obtained from the user behaviors log The behavioral characteristics vector of part.

Network sandbox system altitude simulating realistic network environment in the embodiment of the present invention, network sandbox system is empty by one group Intend mechanism into a virtual switch network constitute, common enterprise-level clothes are deployed in this network on different virtual machines Business and system (such as Windows update server, Oracle, Exchange etc.), and in this network sandbox system In each virtual machine on run the Agent of information search, when file destination on host's virtual machine in sandbox Another virtual machine produce infiltration when (such as once long-range flooding), operate in the information search Agent on the latter Its abnormal behaviour will be recorded.

Due to the network sandbox system in the embodiment of the present invention with n Imaginary Mechanism into network instead of original single void The sandbox system of plan machine, while recognition capability is improve, has but paid n times of Resources Consumption.Therefore in order to solve this Problem, the embodiment of the present invention training period employ each sandbox by n virtual robot arm into pure environment be trained, to carry The accuracy of identification of malicious file high；And this n virtual machine then processes n file, such as each virtual machine operation in the identification phase simultaneously One file, realizes improving the recognition efficiency of malicious file with this, and when finally malicious file is identified, system is needed this again N file destination is processed one time in pure environment again, to determine which is only malicious file actually.Need explanation It is that because most file destination is not malicious file in reality, therefore the embodiment of the present invention is virtual by n in the identification phase Machine then processes n file destination simultaneously, can improve the recognition efficiency of malicious file, if there is malicious file in n virtual machine, This n file destination is processed one time in pure environment again again, to determine which is only malicious file actually；If n Do not exist malicious file in virtual machine, then continue through the file destination that n virtual machine processes next group.

If for example, being necessary to determine whether that, comprising hostile content, network sandbox system includes 10 in the presence of 20 file destinations Individual virtual machine, then first perform 10 average being assigned in the middle of 10 virtual machines of file destination, if execution is not pinpointed the problems, Then rear 10 file destinations are continued to be evenly distributed in 10 virtual machines and performed, if now finding to have problematic target text Part, then processed one time rear 10 file destinations in pure environment again, to determine which is only malicious file actually.

Due to existing abnormal patterns be by true malicious file sample training into, for malicious file lack generalized Process, this also limits the recognition capability of malicious file mutation.Therefore in order to solve this problem, the embodiment of the present invention exists Train during preset grader, add specific noise to make it have the good recognition capability to mutation, i.e., by malice Text and the noise of addition train preset grader.

Specifically, the embodiment of the present invention is trained preset grader by SDA algorithms, an important original of SDA is selected Because being that it can be in the per-training stages by artificial increase noise, the noise resisting ability of lifting system, this ability To recognizing the mutation of Malware and escaping that there is extraordinary effect.Reflecting antimierophonic process is：Input vector x, leads to Cross the process generation for increasing noiseZ is generated by the process of encoder (encoder) and decoder (decoder), and is missed Difference function is then defined as the gap between x and z.The process for minimizing error function by iteration causes that encoder is provided with Antimierophonic ability.

In embodiments of the present invention, x is exactly directly inputting for preset grader, i.e. file destination by static and dynamic special Levy the characteristic vector of extraction.AndIt is the characteristic vector with noise produced by ad hoc fashion, is produced by a noise factor Raw noise, before the entrance network sandbox system of malicious file sample, extracted into static nature before, and network sandbox Increase noise processed on three points after system.

As shown in Fig. 2 obtaining the noise feature vector of malicious file sampleDetailed process is：By the paper sample and The first noise added according to preset noise knowledge base is performed in being put into network sandbox system, obtains the malicious file sample User behaviors log；The second noise added by the user behaviors log of the malicious file sample and according to preset noise knowledge base is obtained The behavioral characteristics vector of the malicious file sample；Added by the malicious file sample and according to preset noise knowledge base 3rd noise obtains the static nature vector of the malicious file sample；Static nature vector according to the malicious file sample And behavioral characteristics vector obtains the noise feature vector of the malicious file sample

It should be noted that noise factor carries out noise addition by the pre-defined rule in preset noise knowledge base, rule Be one group and the action that personnel experience is formulated analyzed according to malicious file, these be some can produce different characteristic results but The action of Malware property is not influenceed.Some simple rules are for example：" Malware is by being still disease after compression or shell adding Poison ", " daily record that normal software is inserted in the sandbox daily record of Malware is still Malware " etc., the embodiment of the present invention is not done It is specific to limit.These rules are simultaneously not necessary to guaranty that and are absolutely correct, it is only necessary to correct in greater probability, by rule generation A small amount of error can be eliminated by follow-up neutral net.The function of noise factor is exactly the rule of selection one or more in knowledge base Act in target data.Noise system only works in the training period of system, is no longer played a role in the runtime of grader.

For the embodiment of the present invention, after the corresponding noise feature vector of paper sample is obtained, noise is not added with maliciously The noise feature vector training corresponding with addition noise malicious file sample of the corresponding characteristic vector of paper sample is described preset point Class device, is not added with the acquisition process of noise malicious file sampling feature vectors and the feature of noise of addition noise malicious file sample Vectorial acquisition process is identical, and the embodiment of the present invention will not be repeated here.In whole preset grader after the training stage is completed, point The parameter of each layer of class device has just been determined, and preset grader can enter into the operation phase, i.e., divided by preset grader Class recognizes that the process of operation phase is a propagated forward process for typical neutral net, defeated eventually through sigmoid functions Go out to recognize the file content malice probability of file destination.

Needs are elaborated, and Row noise addition is entered to malicious file sample, and a malicious file sample can be generated Multiple characteristic vectors with noise.Reason for this is that：First, the sample of malicious file compares with respect to the sample of normal file Hardly possible is obtained, and this way helps to alleviate the problem of imbalanced training sets；Second, with respect to malicious file, normal file does not often have Flight behavior, noise addition is carried out to it unobvious to the lifting of final recognition effect.Therefore the embodiment of the present invention is by making an uproar The introducing of sound system, larger improves the generalized ability that preset grader is recognized to malicious file, to mutation, the identification escaped It is respectively provided with and is significantly lifted.

Need to describe in detail, it is described according to the file content malice probability and the document source of the file destination Whether file destination described in malice determine the probability is that malicious file includes：File destination when by the file destination being malicious file File content malice probability, the file content malice probability of the file destination and the file destination document source maliciously Probability substitutes into the malicious file probability that Bayesian formula calculates the file destination；Malicious file according to the file destination is general Rate determines whether the file destination is malicious file.I.e. embodiment of the present invention basic logic is based on Bayes' theorem：

Calculate the malicious file probability of file destination, wherein P (m | s) in certain circumstances, preset grader output When file content malice probability is s, the result is that the probability of malicious file；P (s | m) for file destination be malicious file when, in advance Put the probability that grader exports s；Under P (m) is specific environment, the source of file destination belongs to the probability of malicious origin, i.e. target The document source malice probability of file；P (s) is that in certain circumstances, the file content of preset grader output file destination is disliked Meaning probability numbers are the probability of s.This probability does some and is converted to following formula：

Further：

Wherein, b represents non-malicious file.P (m) is the text of the file destination exported according to preset malicious origin storehouse in above formula Part is originated malice probability, and to expect that P (m | s) also needs to know P (s | m) and P (s | b), the two probability we by generally Rate density estimation method is estimated that Multilayer networks method is specifically as follows the technologies such as histogram method and kernel method.Enter And P (m | s) has been obtained as the malicious file probability of target software, so as to merge preset grader by the embodiment of the present invention The file content probability and environmental factor that obtain and the probability for producing, improve the accuracy of identification of malicious file.

For the embodiment of the present invention, applicable scene as shown in figure 3, but be not limited only to this, including：In this application scene Input information be divided into two parts, one is file destination, i.e. the binary form of file destination, and another part is file destination Source-information, including its URL, IP, the e-mail sender that originate etc..Analysis to file destination, is divided into dynamic analysis again With static analysis two parts, the virtual execution ability of dynamic analysis and utilization network sandbox system, analysis file destination is in the runtime Behavior, and the behavioral characteristics vector of file destination is obtained from analysis result, and static analysis is then entered the two of file destination Static nature vector is directly extracted in data processed.The characteristic vector that dynamic analysis and static analysis are extracted is inputed to preset Grader is classified, and obtains the file content malice probability of file destination, and is believed by matching the source of the file destination Cease with preset malicious origin storehouse in malicious origin data, determine the document source malice probability of the file destination, finally will The file content malice probability of file destination and the document source malice probability of file destination are calculated commonly through Bayes The malice probability results of final goal file.

In this application scene, the method that file destination analysis employs dynamic analysis and static analysis simultaneously, mainly Because two methods have certain complementarity：Some are by obscuring, the rogue program of shell adding is easier static inspection of escaping Look into, and some inspections of dynamic analysis of then easily being escaped by the harsh rogue program of long-term latent and trigger condition, and two kinds The combination of method has more preferable Checking on effect in practice.In addition, the probability of preset grader output is not account for Drawn in the case of document source, its information is simultaneously insufficient, therefore only as knot in the middle of in this application scene Really.Carried according to this intermediate result and preset malicious origin storehouse confess come determine the probability file destination whether be malice text Part, so as to improve the accuracy of identification of malicious file.

Further, the embodiment of the present invention provides a kind of identifying device of malicious file, as shown in figure 4, described device bag Include：Acquiring unit 21, computing unit 22, recognition unit 23.

Acquiring unit 21, behavioral characteristics vector and static nature vector for obtaining file destination；

Wherein, acquiring unit 21 is obtained according to the file destination of binary form behavioral characteristics vector and static nature to Amount.Analysis to file destination, be divided into again dynamic analysis and static analysis two parts, dynamic analysis and utilization sandbox system it is virtual Executive capability analyzes behavior of the file destination in the runtime, and the behavioral characteristics vector of file destination is obtained from analysis result, And static analysis is then directly extracted feature on the binary data of file destination and is analyzed, and therefrom obtain file destination Static nature vector.

For the embodiment of the present invention, acquiring unit 21 obtains the behavioral characteristics vector and static nature vector of file destination, File destination is analyzed and employs two methods of dynamic analysis and static analysis simultaneously, being primarily due to two methods has one Fixed complementarity：Some are by obscuring, the malicious file of shell adding is easier static inspection of escaping, and some dive by long-term The then easily inspection of escape dynamic analysis of volt and the harsh malicious file of trigger condition, and the combination of two methods has in practice There is more preferable Checking on effect.

Computing unit 22, for the behavioral characteristics vector and static nature vector of the file destination to be input into preset point In class device, the file content malice probability of the file destination is calculated；

Computing unit 22 is first entered the behavioral characteristics vector and static nature vector of file destination by preset grader Row combination, forms a more high-dimensional characteristic vector, is then input to be classified in deep neural network, obtains target text The corresponding file content malice probability of part, file content malice probability is used to represent in file destination comprising the general of hostile content Rate.By SDA, (Stack Denoising Autoencoder, every layer is calculated the structure of preset grader with denoising autocoding Method) it is trained, SDA belongs to the one kind in deep neural network, and its network structure is that the auto-encoders of multilayer is (automatic Coding) neutral net plus multilayer band dropout fully-connected network, last output layer by sigmoid functions export mesh Mark the file content malice probability of file.

Recognition unit 23, for file content malice probability and the file of the file destination according to the file destination Source malice probability recognizes whether the file destination is malicious file, and the document source malice probability of the file destination is root Determine according to the source-information of file destination.

The embodiment of the present invention provides a kind of identifying device of malicious file, and the behavioral characteristics vector of file destination is obtained first And static nature vector, the behavioral characteristics vector and static nature vector of the file destination are then input to preset grader In, the file content malice probability of the file destination is obtained, finally according to the file content malice probability of the file destination Recognize whether the file destination is malicious file with the document source malice probability of the file destination.Due to preset separator The file content malice probability of output is not accounting for being drawn in the case that file destination is originated that its information is not filled Point, thus the embodiment of the present invention using file content malice probability only as an intermediate result, according to this intermediate result and mesh Whether the document source malice probability identification file destination for marking file is malicious file, so as to improve the identification essence of malicious file Degree.

Further, as shown in figure 5, described device also includes：

The acquiring unit 21, is additionally operable to obtain the source-information of the file destination；

Determining unit 24, for by matching the malice in the source-information of the file destination and preset malicious origin storehouse Derived data, determines the document source malice probability of the file destination.

In order to solve this problem, the embodiment of the present invention by network sandbox system obtain the behavioral characteristics of file destination to Amount, as shown in figure 5, the acquiring unit 21 includes：Performing module 211, for the file destination to be put into network sandbox system Performed in system；Obtain the user behaviors log of the file destination；The network sandbox system by one group of Imaginary Mechanism into a void Intend exchange network to constitute；Acquisition module 212, for obtained from the user behaviors log behavioral characteristics of the file destination to Amount.

Due to existing abnormal patterns be by true malicious file sample training into, for malicious file lack generalized Process, this also limits the recognition capability of malicious file mutation.Therefore in order to solve this problem, the embodiment of the present invention exists Train during preset grader, add specific noise to make it have the good recognition capability to mutation, i.e., by training Unit 25 trains preset disaggregated model.Training unit 25, for training described pre- by malice samples of text and the noise of addition Put grader.

Specifically, as shown in figure 5, the training unit 25 includes：

Acquisition module 251, for the first noise added by the malicious file sample and according to preset noise knowledge base Execution in network sandbox system is put into, the user behaviors log of the malicious file sample is obtained；

The acquisition module 251, for the user behaviors log by the malicious file sample and according to preset noise knowledge Second noise of storehouse addition obtains the behavioral characteristics vector of the malicious file sample；

The acquisition module 251, for added by the malicious file sample and according to preset noise knowledge base the Three noises obtain the static nature vector of the malicious file sample；

The acquisition module 251, for static nature vector and behavioral characteristics vector according to the malicious file sample Obtain the noise feature vector of the malicious file sample；

Training module 252, for being not added with the corresponding characteristic vector of noise malicious file sample and addition noise malice text The corresponding noise feature vector of part sample trains the preset grader.

Specifically, as shown in figure 5, the determining unit 24 includes：

Computing module 241, the file content malice probability of file destination during for being malicious file by the file destination, The file content malice probability of the file destination and the document source malice probability of the file destination substitute into Bayesian formula Calculate the malicious file probability of the file destination；

Determining module 242, for file destination described in the malicious file determine the probability according to the file destination whether be Malicious file.

Further：

In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion described in detail in certain embodiment Point, may refer to the associated description of other embodiment.

It is understood that the correlated characteristic in the above method and device can be referred to mutually.In addition, in above-described embodiment " first ", " second " etc. be, for distinguishing each embodiment, and not represent the quality of each embodiment.

It is apparent to those skilled in the art that, for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, may be referred to the corresponding process in preceding method embodiment, will not be repeated here.

Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this hair Bright preferred forms.

In specification mentioned herein, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be put into practice in the case of without these details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it will be appreciated that in order to simplify one or more that the disclosure and helping understands in each inventive aspect, exist Above to the description of exemplary embodiment of the invention in, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention：I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, and wherein each claim is in itself All as separate embodiments of the invention.

Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Unit or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, can use any Combine to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can the alternative features of or similar purpose identical, equivalent by offer carry out generation Replace.

Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection is appointed One of meaning mode can be used in any combination.

All parts embodiment of the invention can be realized with hardware, or be run with one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) realize the channel switching method of DTV according to embodiments of the present invention And some or all functions of some or all parts in device.The present invention is also implemented as performing institute here Some or all equipment or program of device of the method for description are (for example, computer program and computer program are produced Product).It is such to realize that program of the invention be stored on a computer-readable medium, or can have one or more The form of signal.Such signal can be downloaded from internet website and obtained, or be provided on carrier signal, or to appoint What other forms is provided.

It should be noted that above-described embodiment the present invention will be described rather than limiting the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol being located between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element is not excluded the presence of as multiple Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims

1. a kind of recognition methods of malicious file, it is characterised in that including：

The behavioral characteristics vector and static nature vector of the file destination are input in preset grader, the target is calculated The file content malice probability of file；

The document source malice probability identification institute of file content malice probability and the file destination according to the file destination State whether file destination is malicious file, the document source malice probability of the file destination is believed according to the source of file destination What breath determined.

2. method according to claim 1, it is characterised in that the file content malice according to the file destination is general Before whether file destination described in the document source malice determine the probability of rate and the file destination is malicious file, methods described Also include：

Obtain the source-information of the file destination；

By matching the malicious origin data in the source-information of the file destination and preset malicious origin storehouse, the mesh is determined Mark the document source malice probability of file.

3. method according to claim 1, it is characterised in that the behavioral characteristics of the acquisition file destination include：

The file destination is put into network sandbox system and is performed, obtain the user behaviors log of the file destination；The network Sandbox system by one group of Imaginary Mechanism into a virtual switch network constitute；

4. method according to claim 1, it is characterised in that methods described also includes：

5. method according to claim 4, it is characterised in that described to train described by malice text and the noise of addition Preset grader includes：

It is put into network sandbox system and holds by the malicious file sample and according to the first noise that preset noise knowledge base is added OK, the user behaviors log of the malicious file sample is obtained；

Described in the second noise added by the user behaviors log of the malicious file sample and according to preset noise knowledge base is obtained The behavioral characteristics vector of malicious file sample；

The 3rd noise added by the malicious file sample and according to preset noise knowledge base obtains the malicious file sample This static nature vector；

Static nature vector and behavioral characteristics vector according to the malicious file sample obtain making an uproar for the malicious file sample Sound characteristic vector；

By being not added with the corresponding characteristic vector of noise malicious file sample noise corresponding with addition noise malicious file sample Characteristic vector trains the preset grader.

6. method according to claim 1, it is characterised in that described according to the file content malice probability and the mesh Whether file destination described in marking the document source malice determine the probability of file is that malicious file includes：

When by the file destination being malicious file in the file content malice probability of file destination, the file of the file destination The document source malice probability for holding malice probability and the file destination substitutes into the evil that Bayesian formula calculates the file destination Meaning file probability；

7. a kind of identifying device of malicious file, it is characterised in that including：

Computing unit, for the behavioral characteristics vector and static nature vector of the file destination to be input into preset grader In, obtain the file content malice probability of the file destination；

Recognition unit, the document source for the file content malice probability according to the file destination and the file destination is disliked Meaning probability recognizes whether the file destination is malicious file, and the document source malice probability of the file destination is according to target What the source-information of file determined.

8. device according to claim 7, it is characterised in that described device also includes：

Determining unit, for by matching the malicious origin number in the source-information of the file destination and preset malicious origin storehouse According to determining the document source malice probability of the file destination.

9. device according to claim 7, it is characterised in that the acquiring unit includes：

Performing module, performs for the file destination to be put into network sandbox system；Obtain the behavior of the file destination Daily record；The network sandbox system by one group of Imaginary Mechanism into a virtual switch network constitute；

10. device according to claim 7, it is characterised in that described device also includes：