CN110427755A

CN110427755A - A kind of method and device identifying script file

Info

Publication number: CN110427755A
Application number: CN201811202216.0A
Authority: CN
Inventors: 顾成杰
Original assignee: New H3C Security Technologies Co Ltd
Current assignee: New H3C Security Technologies Co Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-11-08

Abstract

This application provides a kind of method and devices for identifying script file, are related to technical field of network security, and method includes: to obtain multiple sample script files for marking and having；Each sample script file is converted into machine instruction sequence；Feature word combination is extracted from the machine instruction sequence of sample script file, obtains the fisrt feature set of sample script file；For fisrt feature set, according to preset words-frequency feature algorithm, the words-frequency feature of each feature word combination in fisrt feature set is calculated separately, the feature vector of sample script file is obtained；According to the label training script identification model of the feature vector of each sample script file and each sample script file；When getting script file to be identified, script file to be identified is identified using script identification model, determines whether script file to be identified is malicious script file.The accuracy of identification webshell file can be improved using the application.

Description

A kind of method and device identifying script file

Technical field

This application involves technical field of network security, more particularly to a kind of method and device for identifying script file.

Background technique

Webshell is with ASP (Active Server Pages, Active Server Page), PHP (Hypertext Preprocessor, HyperText Preprocessor), JSP (Java Server Pages, the java server page), python or A kind of order existing for the page scripts document forms such as CGI (Common Gateway Interface, common gateway interface) is held Row environment may also be referred to as a kind of webpage back door.Webshell usually utilizes the operation for obtaining Website server by invader Permission, hacker is after the server for invading certain website, it will usually will be normal under webshell file and Website server web catalogue Web page files mix, webshell file is then accessed using browser, obtains webshell order performing environment, To obtain the operating right in a way to server, to achieve the purpose that control server.Therefore, in order to safeguard net The safety of site server needs to detect webshell file, and removes webshell file in time.

In the prior art, detection webshell file handling procedure is as follows: technical staff can preset feature database, the spy Levy the feature that library includes a variety of files of malicious script for identification (i.e. webshell file), such as dangerous function, dangerous file The information such as suffix, sensitive document name and content-keyword.The file content that script file to be identified can be included by server, It is matched with the characteristic item in feature database, then determines whether the script file is malicious script according to matching result.For example, If the number of the characteristic item to match with file content be greater than preset threshold, determine the script file for malicious script, or Person, the matching times that a certain characteristic item is matched are more than preset threshold, then determine the script file for malicious script.

However, feature database in the prior art is formerly to be set by technical staff is unified, the feature that feature database includes is inadequate Comprehensively and validity is poor, this makes the accuracy of existing webshell file detection mode lower.

Summary of the invention

The embodiment of the present application is designed to provide a kind of method and device for identifying script file, can be improved The accuracy of webshell file detection mode.Specific technical solution is as follows:

In a first aspect, providing a kind of method for identifying script file, which comprises

Multiple sample script files for marking and having are obtained, the label includes being used to indicate web-page requests as malice net The label of page request is used to indicate the label that web-page requests are non-malicious web-page requests；

Each sample script file is converted into machine instruction sequence；

For each sample script file, using preset feature extraction rule, from the machine of the sample script file Feature word combination is extracted in instruction sequence, obtains the fisrt feature set of the sample script file；

For fisrt feature set, according to preset words-frequency feature algorithm, calculate separately each in the fisrt feature set The words-frequency feature of feature word combination, and according to the words-frequency feature of feature word combination each in the fisrt feature set, determine described in The feature vector of sample script file；

Based on machine learning algorithm, according to the mark of the feature vector of each sample script file and each sample script file Sign training script identification model；

When getting script file to be identified, the script file to be identified is identified using the script identification model, Determine whether the script file to be identified is malicious script file.

Optionally, described according to preset words-frequency feature algorithm, calculate separately each Feature Words in the fisrt feature set Combined words-frequency feature, comprising:

For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample foot Frequency of occurrence in this document, the first ratio of the total number for the feature word combination for including with the sample script file；

In preset corpus, the first number of script file of the determination comprising the specific word combination, and described in determination Second ratio of the total number for the script file that corpus includes and first number, wherein the corpus includes multiple The machine instruction sequence of the machine instruction sequence of non-malicious script file and multiple malicious script files；

According to first ratio and second ratio, the words-frequency feature of the specific word combination is calculated.

It is optionally, described that the words-frequency feature of the specific word combination is calculated according to first ratio and second ratio, Include:

First ratio is determined using following formula:

Wherein, n_wIt is characterized frequency of occurrence of the word combination w in sample script file, N is that sample script file includes The total number of feature word combination；

In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus Second ratio of the total number for the script file that library includes and the first number；

The words-frequency feature of the specific word combination w is determined using following formula:

(TF-IDF)_w=TF_w*IDF_w；

Wherein, (TF-IDF)_wIt is characterized the words-frequency feature of word combination w, TF_wIndicate feature word combination w

Optionally, the method also includes:

The sample script file is extracted using preset foundation characteristic extraction algorithm for each sample script file Foundation characteristic, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of；

The words-frequency feature according to feature word combination each in the fisrt feature set, determines the sample script file Feature vector, comprising:

According to the word frequency of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set Feature constitutes the feature vector of the sample script file.

Optionally, the words-frequency feature according to feature word combination each in the fisrt feature set, determines the sample The feature vector of script file, comprising:

By the words-frequency feature of feature word combination each in the fisrt feature set, the feature of the sample script file is constituted Vector.

Second aspect provides a kind of method for identifying script file, which comprises

Obtain the first script file；

First script file is converted into machine instruction sequence；

Using preset feature extraction rule, feature phrase is extracted from the machine instruction sequence of first script file It closes, obtains the fisrt feature set of first script file；

According to preset words-frequency feature algorithm, each feature in the fisrt feature set of first script file is calculated separately The words-frequency feature of word combination, and according to the word frequency of feature word combination each in the fisrt feature set of first script file spy Sign, determines the feature vector of first script file；

The feature vector of first script file is input in script identification model, first script file is obtained Recognition result, the script identification model according to the feature vector and machine learning algorithm of sample script file training obtain, The feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.

Optionally, described according to preset words-frequency feature algorithm, calculate separately the fisrt feature of first script file The words-frequency feature for each feature word combination that set includes, comprising:

Each feature word combination that fisrt feature set for first script file includes, determines this feature phrase Close the frequency of occurrence in first script file, the total number for the feature word combination for including with first script file First ratio；

Optionally, the method also includes:

Using preset foundation characteristic extraction algorithm, the foundation characteristic of first script file is extracted, the basis is special Sign includes comentropy, longest word length, is overlapped one of index and compression ratio or a variety of；

The words-frequency feature of each feature word combination, determines institute in the fisrt feature set according to first script file State the feature vector of the first script file, comprising:

According to each spy in the fisrt feature set of the foundation characteristic of first script file and first script file The words-frequency feature of word combination is levied, the feature vector of first script file is constituted.

Optionally, the word frequency of each feature word combination is special in the fisrt feature set according to first script file Sign, determines the feature vector of first script file, comprising:

By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, described first is constituted The feature vector of script file.

The third aspect, provides a kind of device for identifying script file, and described device includes:

Module is obtained, the sample script file having for obtaining multiple labels, the label includes being used to indicate net Page request is the label of malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests；

Conversion module, for each sample script file to be converted to machine instruction sequence；

First extraction module is used for for each sample script file, using preset feature extraction rule, from the sample Feature word combination is extracted in the machine instruction sequence of this script file, obtains the fisrt feature set of the sample script file；

First determining module, for according to preset words-frequency feature algorithm, calculating separately described for fisrt feature set The words-frequency feature of each feature word combination in fisrt feature set, and according to the word of feature word combination each in the fisrt feature set Frequency feature determines the feature vector of the sample script file；

Training module, for being based on machine learning algorithm, according to the feature vector of each sample script file as every The label training script identification model of this script file；

Second determining module, for identifying institute using the script identification model when getting script file to be identified Script file to be identified is stated, determines whether the script file to be identified is malicious script file.

Optionally, first determining module, is specifically used for:

First ratio is determined using following formula:

(TF-IDF)_w=TF_w*IDF_w；

Optionally, described device further include:

Second extraction module, using preset foundation characteristic extraction algorithm, is extracted for being directed to each sample script file The foundation characteristic of the sample script file, the foundation characteristic include comentropy, longest word length, are overlapped index and compression ratio One of or it is a variety of；

First determining module, is specifically used for:

Optionally, first determining module, is specifically used for:

Fourth aspect, provides a kind of device for identifying script file, and described device includes:

Module is obtained, for obtaining the first script file；

Conversion module, for first script file to be converted to machine instruction sequence；

First extraction module, for regular using preset feature extraction, from the machine instruction of first script file Feature word combination is extracted in sequence, obtains the fisrt feature set of first script file；

Determining module, for according to preset words-frequency feature algorithm, calculate separately first script file first to be special The words-frequency feature of each feature word combination in collection conjunction, and according to each Feature Words in the fisrt feature set of first script file Combined words-frequency feature determines the feature vector of first script file；

Input module obtains institute for the feature vector of first script file to be input in script identification model State the recognition result of the first script file, the script identification model is according to the feature vector and machine learning of sample script file Algorithm training obtains, and the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.

Optionally, the determining module, is specifically used for:

Optionally, described device further include:

Second extraction module extracts the base of first script file for utilizing preset foundation characteristic extraction algorithm Plinth feature, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of；

The determining module, is specifically used for:

Optionally, the determining module, is specifically used for:

5th aspect, provides a kind of network equipment, including processor, communication interface, memory and communication bus, In, processor, communication interface, memory completes mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, is realized described in above-mentioned first aspect or second aspect Identification script file method and step.

6th aspect, provides a kind of machine readable storage medium, is stored with machine-executable instruction, by processor tune When with executing, the machine-executable instruction promotes the processor: realizing described in above-mentioned first aspect or second aspect Identify the method and step of script file.

7th aspect, provides a kind of computer program product comprising instruction, when run on a computer, so that Computer executes the method that script file is identified described in above-mentioned first aspect or second aspect.

Based on the embodiment of the present application, sample script file is converted into the machine instruction sequence that machine is understood that, is passed through The words-frequency feature of the machine instruction sequence is analyzed, to generate the feature vector of the sample script file, then passes through machine learning The feature vector of algorithm and sample script file carrys out training script identification model.Since script file by encryption or hides calling Afterwards, although its document code can vary widely, have not regulation, the word frequency of its machine instruction sequence converted is special Sign but still has certain regularity, is based on this, can be according to words-frequency feature to malicious script file and non-malicious script file It distinguishes.In the present solution, generating feature vector according to the words-frequency feature of machine instruction sequence, then training script identifies mould Type compensates for feature in this way, the script identification model trained can recognize that the webshell of encryption and hiding call method The deficiency of library detection, has adaptivity strong, the high advantage of discrimination, also, script text is identified by script identification model Part gets rid of limitation (do not need technical staff set feature database) of the conventional method dependent on feature database, and computation complexity is low, Recognition accuracy is higher, meets the engineer application needs of current webshell detection field.Certainly, implement any of the application It is not absolutely required to reach above all advantages simultaneously for product or method.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow chart of method for identifying script file provided by the embodiments of the present application；

Fig. 2 is a kind of exemplary flow chart of method of training script identification model provided by the embodiments of the present application；

Fig. 3 is a kind of flow chart of method for identifying script file provided by the embodiments of the present application；

Fig. 4 is a kind of exemplary flow chart of method for identifying script file provided by the embodiments of the present application；

Fig. 5 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application；

Fig. 6 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application；

Fig. 7 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application；

Fig. 8 is a kind of structural schematic diagram of device for identifying script file provided by the embodiments of the present application；

Fig. 9 is a kind of structural schematic diagram of the network equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

The embodiment of the present application provides a kind of method for identifying script file, and this method can be applied to the network equipment, should The network equipment can be with the background server of certain website, or is also possible to the safety equipment of certain website.The network equipment can be used In identification webshell file, that is, can be used for identifying malicious script file.

As shown in Figure 1, the treatment process of this method can be as follows.

Step 101, multiple sample script files for marking and having are obtained.

Wherein, label includes being used to indicate the label or be used to indicate web-page requests and be that web-page requests are malicious web pages request The label of non-malicious web-page requests.

In the embodiment of the present application, multiple samples of the available technical staff's input of the network equipment or acquisition equipment acquisition Script file.Sample script file may include non-malicious script file, malicious script file.Wherein, malicious script file sample It originally may include the script file of big horse, the script file of pony and script file of a word wooden horse etc..Technical staff can be with For each sample script file, the label of the sample script file is marked, which may include being used to indicate web-page requests For malicious web pages request label or be used to indicate the label that web-page requests are non-malicious web-page requests, so as to non-malicious script File and malicious script file distinguish.

Step 102, each sample script file is converted into machine instruction sequence.

In the embodiment of the present application, it is contemplated that many webshell files will do it encryption and hiding function call, therefore, Sample script file can be compiled the operation code (i.e. machine instruction sequence) for being converted into machine and being understood that by the network equipment.Network is set File compiler algorithm can be previously stored in standby, this document compiler algorithm can be calculated using file in the prior art compiling Method.Due to different types of script file, need to be converted using different file compiler algorithms, therefore, in the network equipment It can be with the corresponding relationship of storage file type and file compiler algorithm.

For each sample script file, after the network equipment gets sample script file, the sample script can be identified The file type of file, and then the corresponding relationship of file type according to the pre-stored data and file compiler algorithm, determine sample foot The corresponding file compiler algorithm of the file type of this document, then by the file compiler algorithm determined, by sample script text Part is converted to machine instruction sequence.For example, using VLD (Vulcan if sample script file is the file of PHP type Logic Dumper, for one kind in Zend engine, that is realized in a manner of hook is used to export the intermediate code of PHP script generation The extension of (execution unit)), the file of PHP type is converted to the machine instruction sequence of Opcode type.For another example, if sample Script file is the file of ASP type, then using JSPC, (Java Server Pagescompiler, java the server page is compiled Translate device), the file of ASP type is converted to the machine instruction sequence of Bytecode type.In this way, technical staff can be write Code translation be the instruction (i.e. machine instruction sequence) that is understood that of machine.

For example, the code in the original document of certain webshell are as follows:

< php

$ new_array=array_map (" ass x65rt ", (array) $ _ REQUEST [' op'])；

>

Its Opcode sequence after converting are as follows: SEND_VAL FETCH_R FETCH_DIM_R CAST SEND_VAL DO_ FCALL ASSIGN RETURN。

Step 103, for each sample script file, using preset feature extraction rule, from the sample script file Machine instruction sequence in extract feature word combination, obtain the fisrt feature set of sample script file.

In the embodiment of the present application, feature extraction rule can be stored in advance in the network equipment.For example, can use Ngram phrase extraction algorithm obtains Ngram set (i.e. characteristic set).It, can be from machine instruction based on this feature extracting rule In sequence, the word combination comprising preset number (i.e. N) a continuous word is extracted.

Wherein, the value of N can be any setting, such as 2,4,6 or 8 etc., and the embodiment of the present application is without limitation.Separately Outside, in the embodiment of the present application, feature extraction can be carried out using one or more feature extraction rules, correspondingly, fisrt feature The number of set can be one or more.

By taking machine instruction sequence SEND_VAL FETCH_R FETCH_DIM_R as an example, wherein character " _ " and space character For the separating character between two words, use 2gram extract to obtain feature word combination for SEND VAL, VAL FETCH, FETCHR, R FETCH, FETCHDIM and DIMR；3gram is used to extract to obtain feature word combination as SENDVAL FETCH, VAL FETCHR, FETCHR FETCH, R FETCHDIM and FETCHDIMR.

For each sample script file, the network equipment can use preset feature extraction rule, from the sample script Feature word combination is extracted in the machine instruction sequence of file, then, the network equipment can remove the feature word combination extracted It handles again, obtains the fisrt feature set of the sample script file.

In the case of feature extraction rule is multiple, sample script file is converted to machine instruction sequence by the network equipment Afterwards, feature word combination can be extracted from machine instruction sequence, obtains the sample script respectively according to multiple feature extraction rules Multiple fisrt feature set of file.

Step 104, the fisrt feature set of sample script file is counted respectively according to preset words-frequency feature algorithm The words-frequency feature of each feature word combination in the fisrt feature set of sample script file is calculated, and according to the first of sample script file The words-frequency feature of each feature word combination, determines the feature vector of sample script file in characteristic set.

In the embodiment of the present application, words-frequency feature algorithm can be previously stored in the network equipment, the words-frequency feature algorithm It can be used for calculating the words-frequency feature of a certain word or a certain feature word combination, the embodiment of the present application uses TF-IDF (term Frequency-inverse document frequency, word frequency-inverse document frequency) words-frequency feature is calculated, specifically Calculation is subsequent to will do it detailed description.

For each sample script file, after the network equipment gets the fisrt feature set of the sample script file, needle To each feature word combination in the fisrt feature set of sample script file, the network equipment can be according to preset words-frequency feature Algorithm calculates the specific word and combines corresponding words-frequency feature.In this way, the network equipment can determine the first of the sample script file The corresponding words-frequency feature of each feature word combination that characteristic set includes, obtains the words-frequency feature set of the sample script file. Then, the network equipment according to the words-frequency feature set of the sample script file, can determine the feature of the sample script file to Amount.There is the case where multiple fisrt feature set for a sample script file, correspondingly, the network equipment can be according to multiple first Characteristic set determines multiple words-frequency feature set, then determines the sample script file according to multiple words-frequency feature set Feature vector, specific treatment process is subsequent to will do it detailed description.

In this way, extracting multiple characteristic sets using various features extracting rule, the feature vector of script file is formed, it can be with The robustness of enhancing identification script file, improves the accuracy of identification.

Optionally, the concrete processing procedure for calculating words-frequency feature using TF-IDF is as follows.

Step 1: determining this feature for each feature word combination that the fisrt feature set of sample script file includes Frequency of occurrence of the word combination in sample script file, the first of the total number for the feature word combination for including with sample script file Ratio.

In the embodiment of the present application, for each sample script file, the network equipment gets the sample script file After the fisrt feature set of sample script file, for each feature phrase in the fisrt feature set of the sample script file It closes, the network equipment can count the specific word and combine the frequency of occurrence in the sample script file.In addition, the network equipment may be used also To count the total number for the feature word combination that the sample script file includes, and then the frequency of occurrence combined with the specific word, remove With the total number for the feature word combination that the sample script file includes, the first ratio is obtained.Specific calculation formula is as follows:

Wherein, n_wIt is characterized frequency of occurrence of the word combination w in the sample script file, N is the sample script file packet The total number of the feature word combination contained.

By taking machine instruction sequence SEND_VAL FETCH_R FETCH_DIM_R as an example, extract to obtain feature using 2gram Phrase is combined into SENDVAL, VAL FETCH, FETCHR, R FETCH, FETCHDIM and DIMR, and Feature Words combination S ENDVAL's goes out Occurrence number is 1, and the total number for the feature word combination which includes is 6, then the first ratio is 1/6.

Step 2 determines the first number of the script file comprising the specific word combination, and really in preset corpus Determine the total number for the script file that corpus includes and the second ratio of the first number.

Wherein, corpus includes the machine instruction sequence of multiple non-malicious script files and the machine of multiple malicious script files Device instruction sequence.

In the embodiment of the present application, the machine instruction of the machine instruction sequence of non-malicious script file and malicious script file Sequence can have some apparent differences, and therefore, the embodiment of the present application is obtaining the positive and negative example sample machine instruction sequence of magnanimity On the basis of construction feature engineering.Corpus can be previously provided in the network equipment, corpus may include multiple non-malicious scripts The machine instruction sequence of the machine instruction sequence of file and multiple malicious script files.Consider that malicious script file will not generally lead to Crossing a machine instruction sequence can identify, therefore, comprehensively considered in this application big horse (it is larger to generally refer to code amount, Multiple functional, possess the multi-functional rocking horses of interaction page) script file, pony (generally refer to that code amount is moderate, has a single function simultaneously Small-sized wooden horse equipped with simple interaction page, the executable single operation such as upper transmitting file, data are packaged, drag library, reaches special Determine purpose) script file, a word wooden horse (is often referred to the wooden horse that a line or a few line codes are constituted, cooperates other tools can To realize the control completed to destination host) script file etc..That is, the machine instruction sequence of the malicious script file of corpus Column, the machine instruction sequence of the script file including big horse, the machine instruction sequence of the script file of pony, a word wooden horse The machine instruction sequence etc. of script file.

For each feature word combination in the fisrt feature set of sample script file, the network equipment can be according to the spy Sign word combination carries out matched and searched in corpus, so that it is determined that corresponding machine instruction sequence includes the foot of the specific word combination This document, and then the number (i.e. the first number) of the script file comprising the specific word combination is counted, in addition, the network equipment may be used also Include the total number of script file with the current corpus of real-time statistics, then calculates the ratio of the total number and the first number (i.e. Second ratio).

Step 3 calculates the words-frequency feature of the specific word combination according to the first ratio and the second ratio.

It in the embodiment of the present application, can be according to the first ratio after the network equipment calculates the first ratio and the second ratio With the second ratio, calculates the specific word and combine corresponding words-frequency feature.Specific calculation formula can be such that

(TF-IDF)_w=TF_w*IDF_w

Wherein, (TF-IDF)_wIt is characterized the words-frequency feature of word combination w, TF_wIndicate feature word combination w in sample script file In probability of occurrence；N₁For the total number for the script file that corpus includes, N₂To include the spy Levy the first number of the script file of word combination w.

IDF_wThe file separating capacity for indicating feature word combination w, if the script file comprising feature word combination w is fewer, Then IDF_wValue it is bigger., whereas if the script file comprising feature word combination w is more, then the value of IDF w is smaller.

In a certain script file (such as sample script file), if a certain feature word combination is in the foot in the script file Probability of occurrence in this document is higher, and (for example probability of occurrence is more than preset threshold, or is the highest feature phrase of probability of occurrence Close), also, probability of occurrence of the specific word combination in entire corpus is lower (for example probability of occurrence is lower than preset threshold), It is then combined for the specific word, biggish TF-IDF can be calculated.Therefore, TF-IDF is commonly used in filtering out common word, Retain important word.It should be noted that above-mentioned IDF_wIn formula, denominator adds 1 for optional way, in order to avoid point Mother is 0, can also add other positive numbers, such as plus 2,1/3 certainly.

In the embodiment of the present application, each feature that the network equipment can include according to the fisrt feature set of sample script file The corresponding words-frequency feature of word combination determines the corresponding feature vector of sample script file.Concrete implementation mode can be a variety of Multiplicity, the embodiment of the present application provides two kinds of feasible implementations, specific as follows.

The word frequency for each feature word combination that the fisrt feature set of sample script file is included by mode one, the network equipment is special Sign constitutes the corresponding feature vector of sample script file.

In the embodiment of the present application, each feature that the fisrt feature set of sample script file can be included by the network equipment The words-frequency feature of word combination constitutes the corresponding feature vector of sample script file.For example, the fisrt feature collection of sample script file Closing includes feature word combination 1, feature word combination 2, feature word combination 3, and corresponding words-frequency feature is (TF-IDF)₁、(TF-IDF)₂、 (TF-IDF)₃, then the corresponding feature vector of sample script file is ((TF-IDF)₁, (TF-IDF)₂, (TF-IDF)₃).For sample The fisrt feature collection of this script file is combined into multiple situations, correspondingly, the network equipment can obtain multiple words-frequency feature set, The network equipment can be by each words-frequency feature set, as a dimension of feature vector, to obtain feature vector.For example, Characteristic set 1 is extracted based on 2gram, corresponding words-frequency feature is (TF-IDF)₁₁、(TF-IDF)₁₂、(TF-IDF)₁₃, it is based on 4gram extracts characteristic set 2, and corresponding words-frequency feature is (TF-IDF)₂₁、(TF-IDF)₂₂、(TF-IDF)₂₃, then feature to Measure A=(a₁,a₂), wherein a₁=((TF-IDF)₁₁、(TF-IDF)₁₂、(TF-IDF)₁₃), a₂=((TF-IDF)₂₁、(TF- IDF)₂₂、(TF-IDF)₂₃)。

Mode two, the network equipment can also calculate the foundation characteristic of sample script file, special according to words-frequency feature and basis Sign, determines the corresponding feature vector of sample script file.Specific processing mode are as follows: be directed to each sample script file, utilize Preset foundation characteristic extraction algorithm, extracts the foundation characteristic of the sample script file, special according to the basis of sample script file The words-frequency feature of each feature word combination in the fisrt feature set of sample script file of seeking peace constitutes the feature of sample script file Vector.

Wherein, foundation characteristic includes comentropy, longest word length, is overlapped one of index and compression ratio or a variety of.

In the embodiment of the present application, foundation characteristic extraction algorithm can also be stored in advance in the network equipment, foundation characteristic can To include comentropy, longest word length, be overlapped one of index and compression ratio or a variety of, in addition, foundation characteristic can be with Including other features in the prior art, the embodiment of the present application is without limitation.The network equipment can be according to sample script file Code (i.e. source code) calculates comentropy, longest word length, is overlapped the foundation characteristics such as index and compression ratio, then, by sample The corresponding word frequency of each feature word combination that the foundation characteristic of this script file and the fisrt feature set of sample script file include Feature, constitutes the corresponding feature vector of sample script file, and this feature vector is multi-C vector.For example, long for longest word Degree can be traversed comprising all words in sample script file, and then determine that sample script file includes that number of characters is most Word, i.e. longest word, the character total number that longest word includes, i.e. longest word length.It for another example, can be with for compression ratio Compression processing is carried out to sample script file, determines the compressed file size of sample script file, then with sample script text The compressed file size of part obtains compression ratio divided by the original size of sample script file.Above- mentioned information entropy, longest word are long The calculating process of degree, coincidence index and compression ratio belongs to the prior art, and the embodiment of the present application repeats no more.

Step 105, it is based on machine learning algorithm, according to the feature vector of each sample script file and each sample script The label training script identification model of file.

In the embodiment of the present application, the network equipment can be based on machine learning algorithm, according to each sample script file The label training script identification model of feature vector and each sample script file.For example, decision tree can be promoted using gradient (English: GradientBoostingDecisionTree, referred to as: GBDT) algorithm, algorithm of support vector machine, random forests algorithm Or the training scripts identification model such as logistic regression algorithm, the embodiment of the present application is without limitation.

Step 106, when getting script file to be identified, script file to be identified is identified using script identification model, Determine whether script file to be identified is malicious script file.

In the embodiment of the present application, when the network equipment gets script file to be identified, the network equipment can extract this Then the feature vector of script file to be identified is input to script identification model by the feature vector of script file to be identified In, it whether is malicious script file with determination script file to be identified.Specifically identification process is subsequent will do it detailed description.

In the embodiment of the present application, training process (such as step 101-105) and identification process (such as step 106) can be It executes, can also be executed on same electronic equipment on different electronic equipments.

The embodiment of the present application also provides a kind of exemplary flow charts of the training method of identification model, as shown in Figure 2.This shows In example, feature extraction rule uses 2gram, 3gram and 4gram, in this way, being directed to any sample script file, the network equipment can To calculate separately out the corresponding words-frequency feature set of 2gram, the corresponding words-frequency feature set of 3gram and the corresponding word frequency of 4gram Characteristic set.Also, the network equipment can calculate the foundation characteristic of the sample script file, to obtain the script file sample Corresponding feature vector, this feature vector include the foundation characteristic of the script file sample and the word of the script file sample Frequency characteristic set, and then according to this feature vector training script identification model.Specific treatment process and above-mentioned steps 101~step Rapid 105 process is similar, and details are not described herein again.

The embodiment of the present application also provides the training method example of another script identification model, the example is to be based on C4.5 It is illustrated for decision Tree algorithms training script identification model.C4.5 decision Tree algorithms are a kind of used in machine learning and data The algorithm of classification problem in excavation.That is, giving a data set, each of these sample can use one group of feature (i.e. one A feature vector) it describes, each sample belongs to a certain classification in the classification of a mutual exclusion.The target of C4.5 decision Tree algorithms It is to be learnt by training, finds the mapping relations from feature vector to classification, it is subsequent, it is based on this mapping relations, it can The object to be identified unknown to classification is classified.

Each internal node (i.e. non-leaf nodes) indicates the test in a feature, each branch in C4.5 decision tree A test output is represented, and each leaf node stores a class label (i.e. label).Once decision tree is established, for one The sample of a not given class label, can track a path by root node to leaf node, the class stored in the leaf node Label is the class label (i.e. label) of the prediction classification of the sample.

Assuming that the collection of sample script file is combined into D, the collection for the feature vector determined according to sample script file is combined into A, Decision tree T after training can be exported, specific training process are as follows.

Step 1: sampling feature vectors file is obtained.

Wherein, the set D of above-mentioned sample script file includes multiple sample script files for marking and having, the label packet Include the label or be used to indicate the mark that web-page requests are non-malicious web-page requests for being used to indicate that web-page requests are malicious web pages request Label.For each sample script file, after the feature vector for determining the sample script file, with the sample script file Label marks the feature vector of the sample script file, obtains the feature vector for being marked with label.In this way, available sample is special Vector file is levied, which includes multiple feature vectors for marking and having.

Step 2: sample data is normalized.

Step 3: decision tree is established

Wherein, establishing decision tree specifically can be as follows.

(1) it if set A is sky, generates the tree node that an information number is all 0 and returns.

(2) if the feature vector in set A is same category C_k, then it generates a leaf node and returns, the leaf The class of node is marked as C_k。

(3) in addition to above-mentioned (1) and (2) the case where, calculates separately every then for every kind of feature in feature vector The ratio of profit increase of kind feature.The calculation formula of ratio of profit increase can be such that

Wherein, IGR (English: Information gain rate) is information gain-ratio, IG (English: Information It gain is) information gain of this feature, IV (English: Information Value) is the division information of this feature.

(4) the maximum fisrt feature of ratio of profit increase and the corresponding mode decision scheme of the fisrt feature are determined, by the fisrt feature It is added in decision tree.

Wherein, mode decision scheme includes decision threshold and the corresponding outgoing route of each court verdict.

(5) it for other features in addition to fisrt feature, repeats step (4), recurrence builds related subtree.

(6) decision tree T is exported.

Ratio of profit increase based on the decision tree that above-mentioned process constructs, feature is bigger, then illustrates that the distinction of this feature is higher, The corresponding child node of this feature is closer from the root node of decision tree；Conversely, the ratio of profit increase of feature is lower, then illustrate the area of this feature Divide property lower, the corresponding child node of this feature is remoter from the root node of decision tree.

Optionally, after constructing decision tree, preset number sample script file can be input in the decision tree Test judgement situation, and beta pruning processing can be carried out to the decision tree constructed using pessimistic error rate estimating algorithm, to improve The judgement accuracy of decision tree.Then, output beta pruning treated decision tree.

As shown in figure 3, the embodiment of the present application also provides it is a kind of identify script file method flow diagram, specifically include with Lower step.

Step 301, the first script file is obtained.

In the embodiment of the present application, available first script file to be identified of the network equipment.For example, when needing to certain When script file (i.e. the first script file) carries out legitimacy identification, which can be input to net by technical staff In network equipment, the network equipment can then receive the first script file of technical staff's input, and for another example, the network equipment also can receive The first script file sent to other network equipments.The network equipment can also periodically carry out script file identification automatically, Available the first script file being currently locally stored of the network equipment, alternatively, can from destination network device to be detected, Obtain the first script file

Step 302, the first script file is converted into machine instruction sequence.

The concrete processing procedure of this step is referred to illustrating for step 102, and details are not described herein again.

Step 303, using preset feature extraction rule, feature is extracted from the machine instruction sequence of the first script file Word combination obtains the fisrt feature set of the first script file.

The concrete processing procedure of this step is referred to illustrating for step 103, and details are not described herein again.

Step 304, it according to preset words-frequency feature algorithm, calculates separately each in the fisrt feature set of the first script file The words-frequency feature of feature word combination, and according to the word frequency of feature word combination each in the fisrt feature set of the first script file spy Sign, determines the feature vector of the first script file.

The concrete processing procedure of this step is referred to illustrating for step 104, and details are not described herein again.

Step 1, each feature word combination that the fisrt feature set for the first script file includes, determines this feature Frequency of occurrence of the word combination in the first script file, the first of the total number for the feature word combination for including with the first script file Ratio.

In the embodiment of the present application, for each feature word combination in the fisrt feature set of the first script file, net Network equipment can count the frequency of occurrence of the specific word combination after carrying out feature word combination to the first script file, thus The frequency of occurrence in the first script file is combined to the specific word.In addition, the network equipment can also count the feature extracted The total number (i.e. the total number for the feature word combination that the first script file includes) of word combination, and then combined and corresponded to the specific word Frequency of occurrence obtain the first ratio divided by the total number for the feature word combination that the first script file includes.It is specific to calculate public affairs Formula is as follows:

Wherein, n_wIt is characterized frequency of occurrence of the word combination w in the first script file, N is that the first script file includes The total number of feature word combination.

It, in the embodiment of the present application, can after the network equipment calculates the first ratio and the second ratio in the embodiment of the present application To calculate the specific word and combine corresponding words-frequency feature according to the first ratio and the second ratio.Specific calculation formula can be as Under:

(TF-IDF)_w=TF_w*IDF_w

Wherein, (TF-IDF)_wIt is characterized the words-frequency feature of word combination w, TF_wIndicate feature word combination w in the first script file In probability of occurrence；N₁For the total number for the script file that corpus includes, N₂To include the spy Levy the first number of the script file of word combination w.

In a certain specific file (such as first script file), if a certain feature word combination is in the spy in the specific file Determining probability of occurrence in file, higher (for example probability of occurrence is more than preset threshold, or is the highest feature phrase of probability of occurrence Close), also, probability of occurrence of the specific word combination in entire corpus is lower (for example probability of occurrence is lower than preset threshold), It is then combined for the specific word, biggish TF-IDF can be calculated.Therefore, TF-IDF tends to filter out common word, protects Stay important word.It should be noted that above-mentioned IDF_wIn formula, denominator adds 1 to be optional way, in order to avoid denominator It is 0, other positive numbers, such as plus 2,1/3 can also be added certainly.

In the embodiment of the present application, each feature that the network equipment can include according to the fisrt feature set of the first script file The corresponding words-frequency feature of word combination determines that the corresponding feature vector of the first script file, concrete implementation mode can be a variety of Multiplicity, the embodiment of the present application provides two kinds of feasible implementations, specific as follows.

Mode one, the network equipment by the words-frequency feature of feature word combination each in the fisrt feature set of the first script file, Constitute the feature vector of the first script file.

Mode two, the network equipment can also calculate the foundation characteristic of the first script file, special according to words-frequency feature and basis Sign, determines the corresponding feature vector of the first script file, specific processing mode are as follows: is extracted and is calculated using preset foundation characteristic Method extracts the foundation characteristic of the first script file；According to the first of the foundation characteristic of the first script file and the first script file The words-frequency feature of each feature word combination in characteristic set constitutes the feature vector of the first script file.

Step 305, the feature vector of the first script file is input in script identification model, obtains the first script file Recognition result.

Wherein, which can be the script identification mould that 101~step 105 through the above steps trains Type.

In the embodiment of the present application, the corresponding feature vector of the first script file can be input to preset by the network equipment In script identification model, which can then export the corresponding recognition result of the first script file, which can Think non-malicious script file or malicious script file.

It is directed to previous webshell detection method, either traditional static matching method is still based on machine learning The discrimination model of algorithm is substantially also to rely on feature database to differentiate, feature has relied on the collection of manpower and enriches, when Can not often it realize when webshell file is using encryption or hiding means, and in the embodiment of the present application, script file is compiled It translates and is converted into machine instruction sequence, then, analyze the rule (i.e. words-frequency feature) of the instruction sequence of positive and negative example sample, and analyze foot The composition of the instruction sequence of this document closes the comentropy of original script, longest word length, is overlapped the features groups such as index, compression ratio At the feature of each script file, comprehensively consider two kinds of features, to form the feature vector of script file to be identified, without according to Rely in the feature database manually established, also, recognition accuracy with higher.Moreover, the application passes through machine learning training Script identification model detects whether script file is webshell, and this method has computation complexity low, adaptivity is good.And And since script file is after encrypting or hiding and call, although its document code can occur large change, not have rule Property, but the words-frequency feature of the machine instruction sequence of its conversion but still has certain regularity, is based on this, it can be according to word frequency Feature distinguishes malicious script file and non-malicious script file.In the present solution, special according to the word frequency of machine instruction sequence Sign generates feature vector, then training script identification model, in this way, the script identification model trained can recognize that encryption and The features such as webshel script that hidden method calls, real-time is high, meet the engineer application of current webshell detection field It needs.

The embodiment of the present application also provides a kind of method examples for identifying script file, as shown in figure 4, concrete processing procedure It can be as follows.

Step 401, the first script file is obtained.

The concrete processing procedure of this step is referred to illustrating for step 101, and details are not described herein again.

Step 402, using preset foundation characteristic extraction algorithm, the foundation characteristic of the first script file is extracted.

Wherein, foundation characteristic includes comentropy, longest word length, is overlapped index and compression ratio etc..

Step 403, the first script file is converted into machine instruction sequence.

Step 404, it using preset N-gram phrase extraction algorithm, is mentioned from the machine instruction sequence of the first script file Feature word combination is taken, the fisrt feature set of the first script file is obtained.

Step 405, the TF-IDF feature of each feature word combination in the fisrt feature set of the first script file is calculated (i.e. Words-frequency feature).

Step 406, according to words-frequency feature and foundation characteristic, the feature vector of the first script file is constituted.

Step 407, the feature vector of the first script file is input in script identification model, obtains the first script file Recognition result.

The concrete processing procedure of this step is referred to illustrating for step 106 and step 301~304, herein not It repeats again.

In the embodiment of the present application, the network equipment obtains the first script file, and the first script file is then converted to machine Instruction sequence extracts feature word combination from machine instruction sequence, obtains first script using preset feature extraction rule The fisrt feature set of file, and then according to preset words-frequency feature algorithm, calculate separately the fisrt feature of the first script file The corresponding words-frequency feature of each feature word combination that set includes, and according to the fisrt feature set of the first script file include it is each The corresponding words-frequency feature of feature word combination, determines the feature vector of the first script file.The network equipment is by the first script file pair The feature vector answered is input in script identification model, obtains the corresponding recognition result of the first script file.

Based on the embodiment of the present application, it is directed to previous webshell detection method, either traditional static matching side Method is still substantially also to rely on feature database to differentiate based on the discrimination model of machine learning algorithm, and feature has relied on people The collection of power and abundant, can not often realize when webshell file is using encryption or hiding means, and the application is implemented In example, machine instruction sequence is converted by script file compiling, then, analyzes the rule of the instruction sequence of positive and negative example sample (i.e. Words-frequency feature), and the composition for analyzing the instruction sequence of script file closes the comentropy of original script, longest word length, is overlapped The features such as index, compression ratio form the feature of each script file, comprehensively consider two kinds of features, to form script text to be identified The feature vector of part needs not rely upon the feature database manually established, also, recognition accuracy with higher.Moreover, the application Detect whether script file is webshell by the script identification model of machine learning training, this method, which has, calculates complexity Spend low, adaptivity is good.Also, due to script file through encryption or hide call after, although its document code can occur compared with Change greatly, have not regulation, but the words-frequency feature of the machine instruction sequence of its conversion but still has certain regularity, base In this, malicious script file and non-malicious script file can be distinguished according to words-frequency feature.In the present solution, according to machine The words-frequency feature of instruction sequence generates feature vector, then training script identification model, in this way, the script identification model trained It can recognize that the webshel script that encryption and hidden method call, the features such as real-time is high, meet current webshell inspection The engineer application in survey field needs.

Based on the same technical idea, as shown in figure 5, the embodiment of the present application also provides a kind of dresses for identifying script file It sets, which includes:

Module 510 is obtained, the sample script file having for obtaining multiple labels, label includes being used to indicate webpage Request is the label of malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests；

Conversion module 520, for each sample script file to be converted to machine instruction sequence；

First extraction module 530 is used for for each sample script file, using preset feature extraction rule, from sample Feature word combination is extracted in the machine instruction sequence of this script file, obtains the fisrt feature set of sample script file；

First determining module 540, for being calculated separately for fisrt feature set according to preset words-frequency feature algorithm The words-frequency feature of each feature word combination in fisrt feature set, and according to the word frequency of feature word combination each in fisrt feature set spy Sign, determines the feature vector of sample script file；

Training module 550, for being based on machine learning algorithm, according to the feature vector of each sample script file and each The label training script identification model of sample script file；

Second determining module 560, for being identified wait know using script identification model when getting script file to be identified Other script file determines whether script file to be identified is malicious script file.

Optionally, the first determining module 540, is specifically used for:

For each feature word combination that fisrt feature set includes, determine the specific word combination in sample script file Frequency of occurrence, the first ratio of the total number for the feature word combination for including with sample script file；

In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine corpus Second ratio of the total number for the script file that library includes and the first number, wherein corpus includes multiple non-malicious script texts The machine instruction sequence of the machine instruction sequence of part and multiple malicious script files；

According to the first ratio and the second ratio, the words-frequency feature of the specific word combination is calculated.

Optionally, the first determining module 540, is specifically used for:

The first ratio is determined using following formula:

(TF-IDF)_w=TF_w*IDF_w；

Optionally, as shown in fig. 6, the device further include:

Second extraction module 570, for being directed to each sample script file, using preset foundation characteristic extraction algorithm, The foundation characteristic of the sample script file is extracted, foundation characteristic includes comentropy, longest word length, is overlapped index and compression ratio One of or it is a variety of；

First determining module 540, is specifically used for:

According to the words-frequency feature of feature word combination each in the foundation characteristic of sample script file and fisrt feature set, constitute The feature vector of sample script file.

Optionally, the first determining module 540, is specifically used for:

By the words-frequency feature of feature word combination each in fisrt feature set, the feature vector of sample script file is constituted.

Based on the same technical idea, as shown in fig. 7, the embodiment of the present application also provides a kind of dresses for identifying script file It sets, which includes:

Module 710 is obtained, for obtaining the first script file；

Conversion module 720, for the first script file to be converted to machine instruction sequence；

First extraction module 730, for referring to from the machine of first script file using preset feature extraction rule Extraction feature word combination in sequence is enabled, the fisrt feature set of first script file is obtained；

Determining module 740, for calculating separately the first of first script file according to preset words-frequency feature algorithm The words-frequency feature of each feature word combination in characteristic set, and according to each feature in the fisrt feature set of first script file The words-frequency feature of word combination determines the feature vector of first script file；

Input module 750 is obtained for the feature vector of first script file to be input in script identification model The recognition result of first script file, the script identification model is according to the feature vector and engineering of sample script file It practises algorithm training to obtain, the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.

Optionally, determining module 740 are specifically used for:

Each feature word combination that fisrt feature set for the first script file includes determines that the specific word combination exists Frequency of occurrence in first script file, the first ratio of the total number for the feature word combination for including with the first script file；

Optionally, as shown in figure 8, the device further include:

Second extraction module 760 extracts first script file for utilizing preset foundation characteristic extraction algorithm Foundation characteristic, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of；

Determining module 740, is specifically used for:

Optionally, determining module 740 are specifically used for:

The embodiment of the present application also provides a kind of network equipments, as shown in figure 9, include processor 901, communication interface 902, Memory 903 and communication bus 904, wherein processor 901, communication interface 902, memory 903 are complete by communication bus 904 At mutual communication,

Memory 903, for storing computer program；

Processor 901, when for executing the program stored on memory 903, so that the network equipment executes above-mentioned knowledge The step of method of other script file.

The communication bus that the above-mentioned network equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.

Communication interface is for the communication between the above-mentioned network equipment and other equipment.

Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (Digital Signal Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components.

In another embodiment provided by the present application, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with computer program in storage medium, the computer program realizes any of the above-described knowledge when being executed by processor The method of other script file.

In another embodiment provided by the present application, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that the step of computer executes the method for any of the above-described identification script file.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.

The foregoing is merely the preferred embodiments of the application, are not intended to limit the protection scope of the application.It is all Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application It is interior.

Claims

1. a kind of method for identifying script file, which is characterized in that the described method includes:

Multiple sample script files for marking and having are obtained, the label includes being used to indicate web-page requests to ask for malicious web pages The label asked or the label for being used to indicate web-page requests as non-malicious web-page requests；

Each sample script file is converted into machine instruction sequence；

For each sample script file, using preset feature extraction rule, from the machine instruction of the sample script file Feature word combination is extracted in sequence, obtains the fisrt feature set of the sample script file；

Each feature in the fisrt feature set is calculated separately according to preset words-frequency feature algorithm for fisrt feature set The words-frequency feature of word combination, and according to the words-frequency feature of feature word combination each in the fisrt feature set, determine the sample The feature vector of script file；

Based on machine learning algorithm, instructed according to the label of the feature vector of each sample script file and each sample script file Practice script identification model；

When getting script file to be identified, the script file to be identified is identified using the script identification model, is determined Whether the script file to be identified is malicious script file.

2. calculating separately the method according to claim 1, wherein described according to preset words-frequency feature algorithm The words-frequency feature of each feature word combination in the fisrt feature set, comprising:

For each feature word combination that the fisrt feature set includes, determine the specific word combination in the sample script text Frequency of occurrence in part, the first ratio of the total number for the feature word combination for including with the sample script file；

In preset corpus, the first number of the script file comprising the specific word combination is determined, and determine the corpus Second ratio of the total number for the script file that library includes and first number, wherein the corpus includes multiple non-evils The machine instruction sequence of script file of anticipating and the machine instruction sequence of multiple malicious script files；

3. according to the method described in claim 2, it is characterized in that, described according to first ratio and second ratio, Calculate the words-frequency feature of the specific word combination, comprising:

First ratio is determined using following formula:

Wherein, n_wIt is characterized frequency of occurrence of the word combination w in sample script file, N is the Feature Words that sample script file includes Combined total number；

In preset corpus, the first number of the script file comprising the specific word combination w is determined, and determine corpus packet Second ratio of the total number of the script file contained and the first number；

(TF-IDF)_w=TF_w*IDF_w；

Wherein, (TF-IDF)_wIt is characterized the words-frequency feature of word combination w, TF_wIndicate feature word combination w in sample script file Probability of occurrence；N₁For the total number for the script file that corpus includes, N₂To include the specific word Combine the first number of the script file of w.

4. the method according to claim 1, wherein the method also includes:

The basis of the sample script file is extracted using preset foundation characteristic extraction algorithm for each sample script file Feature, the foundation characteristic include comentropy, longest word length, are overlapped one of index and compression ratio or a variety of；

The words-frequency feature according to feature word combination each in the fisrt feature set, determines the spy of the sample script file Levy vector, comprising:

According to the words-frequency feature of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set, Constitute the feature vector of the sample script file.

5. the method according to claim 1, wherein described according to feature phrase each in the fisrt feature set The words-frequency feature of conjunction determines the feature vector of the sample script file, comprising:

By the words-frequency feature of feature word combination each in the fisrt feature set, constitute the feature of the sample script file to Amount.

6. a kind of method for identifying script file, which is characterized in that the described method includes:

Obtain the first script file；

First script file is converted into machine instruction sequence；

Using preset feature extraction rule, feature word combination is extracted from the machine instruction sequence of first script file, Obtain the fisrt feature set of first script file；

According to preset words-frequency feature algorithm, each feature phrase in the fisrt feature set of first script file is calculated separately The words-frequency feature of conjunction, and according to the words-frequency feature of feature word combination each in the fisrt feature set of first script file, really The feature vector of fixed first script file；

The feature vector of first script file is input in script identification model, the knowledge of first script file is obtained It is described not as a result, the script identification model is obtained according to the training of the feature vector and machine learning algorithm of sample script file The feature vector of sample script file is determined according to the words-frequency feature of the sample script file.

7. according to the method described in claim 6, calculating separately it is characterized in that, described according to preset words-frequency feature algorithm The words-frequency feature for each feature word combination that the fisrt feature set of first script file includes, comprising:

Each feature word combination that fisrt feature set for first script file includes determines that the specific word combination exists Frequency of occurrence in first script file, the first of the total number for the feature word combination for including with first script file Ratio；

8. according to the method described in claim 6, it is characterized in that, the method also includes:

Using preset foundation characteristic extraction algorithm, the foundation characteristic of first script file, the foundation characteristic packet are extracted It includes comentropy, longest word length, be overlapped one of index and compression ratio or a variety of；

The words-frequency feature of each feature word combination in the fisrt feature set according to first script file determines described The feature vector of one script file, comprising:

According to each Feature Words in the fisrt feature set of the foundation characteristic of first script file and first script file Combined words-frequency feature constitutes the feature vector of first script file.

9. according to the method described in claim 6, it is characterized in that, the fisrt feature collection according to first script file The words-frequency feature of each feature word combination in conjunction, determines the feature vector of first script file, comprising:

By the words-frequency feature of feature word combination each in the fisrt feature set of first script file, first script is constituted The feature vector of file.

10. a kind of device for identifying script file, which is characterized in that described device includes:

Module is obtained, the sample script file having for obtaining multiple labels, the label includes being used to indicate webpage to ask It seeks the label for malicious web pages request or is used to indicate the label that web-page requests are non-malicious web-page requests；

First extraction module is used for for each sample script file, using preset feature extraction rule, from the sample foot Feature word combination is extracted in the machine instruction sequence of this document, obtains the fisrt feature set of the sample script file；

First determining module, for calculating separately described first according to preset words-frequency feature algorithm for fisrt feature set The words-frequency feature of each feature word combination in characteristic set, and according to the word frequency of feature word combination each in fisrt feature set spy Sign, determines the feature vector of the sample script file；

Training module, for being based on machine learning algorithm, according to the feature vector of each sample script file and each sample foot The label training script identification model of this document；

Second determining module, for when getting script file to be identified, using the script identification model identify it is described to It identifies script file, determines whether the script file to be identified is malicious script file.

11. device according to claim 10, which is characterized in that first determining module is specifically used for:

12. device according to claim 11, which is characterized in that first determining module is specifically used for:

First ratio is determined using following formula:

(TF-IDF)_w=TF_w*IDF_w；

13. device according to claim 10, which is characterized in that described device further include:

Second extraction module, for extracting the sample using preset foundation characteristic extraction algorithm for each sample script file The foundation characteristic of this script file, the foundation characteristic include comentropy, longest word length, are overlapped in index and compression ratio It is one or more；

First determining module, is specifically used for:

According to the words-frequency feature of each feature word combination in the foundation characteristic of the sample script file and the fisrt feature set, Constitute the feature vector of the sample script file；

Or

First determining module, is specifically used for:

14. a kind of device for identifying script file, which is characterized in that described device includes:

Module is obtained, for obtaining the first script file；

First extraction module, for regular using preset feature extraction, from the machine instruction sequence of first script file Middle extraction feature word combination, obtains the fisrt feature set of first script file；

Determining module, for calculating separately the fisrt feature collection of first script file according to preset words-frequency feature algorithm The words-frequency feature of each feature word combination in conjunction, and according to feature word combination each in the fisrt feature set of first script file Words-frequency feature, determine the feature vector of first script file；

Input module obtains described for the feature vector of first script file to be input in script identification model The recognition result of one script file, the script identification model is according to the feature vector and machine learning algorithm of sample script file Training obtains, and the feature vector of the sample script file is determined according to the words-frequency feature of the sample script file.

15. a kind of network equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes any institute of claim 1-5 or claim 6-9 The method and step stated.

16. a kind of machine readable storage medium, which is characterized in that be stored with machine-executable instruction, by processor call and When execution, the machine-executable instruction promotes the processor: realizing that claim 1-5 or claim 6-9 is any described Method and step.