CN107516041B

CN107516041B - WebShell detection method and system based on deep neural network

Info

Publication number: CN107516041B
Application number: CN201710705914.1A
Authority: CN
Inventors: 张涛; 齐龙晨; 宁戈
Original assignee: Beijing Anpro Information Technology Co ltd
Current assignee: Beijing Anpro Information Technology Co ltd
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2020-04-03
Anticipated expiration: 2037-08-17
Also published as: CN107516041A

Abstract

The invention discloses a WebShell detection method and a system thereof based on a deep neural network, wherein a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntax information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the feature extraction and WebShell detection comprise preprocessing, sample generation and WebShell detection; firstly, lexical and grammatical information of a script is automatically acquired, and then the characteristic extraction and WebShell detection are completed by using a recurrent neural network based on an abstract syntax tree. The method has the advantages of low deployment cost, good transportability and high detection accuracy.

Description

WebShell detection method and system based on deep neural network

Technical Field

The invention relates to the technical field of information security, in particular to a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree.

Background

WebShell is a command execution environment in the form of a web page, often used by intruders as a backdoor tool for operating web servers. An attacker obtains the management authority of the Web service through the WebShell, so that penetration and control on Web application are achieved.

Since the characteristics of the WebShell and the common Web page are almost consistent, the detection of the traditional firewall and antivirus software can be avoided. And with the application of various feature confusion hiding technologies for anti-detection to WebShell, a traditional detection mode based on feature code matching is difficult to detect new variants in time.

From the attacker perspective, WebShell is a script Trojan backdoor written by asp, aspx, php or jsp and the like. After an attacker invades a website, the script files are often uploaded to a Web server directory. By means of browser access, the Web server can be controlled while the script file is accessed, for example, data of a website database is read, files on the website server are deleted, and even system branch commands can be directly operated if the Web authority is high.

The existing WebShell detection methods are white box detection, namely detection is carried out on a source code of a WebShell script file, and the existing WebShell detection methods can be specifically divided into two types of detection based on a host and detection based on a network.

Host-based detection: among these methods, the detection method more common in the industry is to directly use known keywords as features, to search suspicious files by grep sentences and then to manually analyze the suspicious files, or to periodically check MD5 values of existing files and to check whether new files are generated. This intuitive detection method is easily circumvented by attackers using obfuscation means.

Network-based detection: the current existing method mainly focuses on configuring an intrusion detection system, namely a WAF, at a network entrance to detect WebShell, and judges whether an attacker uploads HTML or script files by analyzing whether special keywords (e.g., < form, <%, <. The method needs large expenditure and also has the possibility of false alarm; and can only detect the behavior of uploading WebShell, but cannot be detected by the existing WebShell.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of the script mainly through the program language processing technology, and completes feature extraction and WebShell detection by utilizing a deep neural network at the same time, wherein the method is mainly aimed at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like. The system mainly comprises three modules, namely a preprocessing module utilizing a programming language processing technology, a sample generation module completing vectorization expression and a detection module utilizing a deep learning technology. The method has the advantages of low deployment cost, good transportability and high detection accuracy.

The following are several typical neural network model-related term definitions:

the operational formula of the neural network can be defined as:

wherein o is⁽ⁱ⁾Output vector, o, representing the i-th layer of the neural network⁽ⁱ⁾The dimension of (a) is the number of neurons (number of network nodes); x is the number of⁽ⁱ⁾Is the output of the i-1 th layer of the network and is used as the input of the i-th layer; w⁽ⁱ⁾And b⁽ⁱ⁾Is a parameter of the i-th layer of the neural network; balance

For activating functions, in general

Is a non-linear function. This neural network layer is called a fully connected layer (Full connectivity layer).

A Recurrent Neural Network (RNN) is used to process sequence inputs. The recurrent neural network processes one input sequence element at a time while maintaining the historical state of all past time sequence elements with one hidden unit.

The calculation formula of the recurrent neural network layer is as follows:

wherein x is_tIs an input vector, S_tIs a hidden unit vector, o_tIs the output vector; w, U, V is a parameter that is,

is an activation function.

The first appearance of the Pooling Layer (Pooling Layer) in convolutional neural networks was a downsampling window that slides through the input matrix. And the pooling layer performs down-sampling in the corresponding matrix sub-area according to the sampling function each time, slides to the next position according to the specified step until the sampling in the whole input matrix is finished, and finally outputs the sampling result matrix to the next layer. The most common sampling methods are maximum sampling, minimum sampling and mean sampling.

The splicing Layer (splicing Layer) is responsible for merging the input k vectors into one output vector, namely:

o＝i₁&i₂&,…,&i_kwherein&Are hyphens.

The technical scheme provided by the invention is as follows:

a WebShell detection method based on deep Neural Network, based on the Recurrent Neural Network (AST _ RRNN, Recurrent Neural Network based on abstract syntax tree), utilize the hierarchical structural feature of the abstract syntax tree, carry on WebShell to the script language of the mainstream; the method comprises a pretreatment process, a sample generation process and a detection process; the method specifically comprises the following steps (the scheme flow diagram is specifically shown in figure 1):

A. the script file is firstly preprocessed: the preprocessing module comprises a lexical analyzer, a syntax analyzer and a simplification module, wherein the input is script source codes, and the output is an abstract syntax tree AST (abstract syntax tree), and the specific steps are as follows:

A1. performing lexical analysis on the program codes to generate a lexical unit stream;

A2. the lexical analyzer analyzes the lexical unit flow to construct an abstract syntax tree;

A3. and simplifying the steps, namely filtering out semantic irrelevant information such as comments and the like after the step of grammar analysis.

B. And (4) generating a sample. The sample generation module of the WebShell detection method of AST _ RRNN comprises two parts of input contents: simplified AST and AST leaf nodes. However, since the difference in the size of the abstract syntax tree (the number of nodes of the tree) adversely affects the training and prediction of the detection module, the abstract syntax tree needs to be compressed before vectorization. And the sample generation module is responsible for converting the abstract syntax tree into a vectorized representation that facilitates training and prediction by the detection module. The compression steps of the abstract syntax tree are as follows:

B1. the compression of abstract syntax tree mainly utilizes the concepts and methods of n-node sampling sub-tree and m-branch tree transformation to limit the size of abstract syntax tree, and in addition, simple characteristic engineering method is utilized to complete the vectorization representation of leaf nodes.

B2. Vectorized representation of tree nodes. Adopting a One Hot Encoding (One Hot Encoding) method as the most intuitive vectorization Encoding method, adopting a node v of a One Hot Encoding vectorization abstract syntax tree, and marking as One _ Hot (v) to represent the node type of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.

B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.

C. And (3) detection process: and a deep neural network is adopted as a detection module. And aiming at the tree structure of the abstract syntax tree, a recursive cyclic neural network is adopted. The method comprises the following steps:

C1. aiming at the tree structure, the scheme defines a neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The recursion _ LSTM layer exploits the Recursive nature of trees whose vector representation is generated by some non-linear operation from the vector representations of their root nodes and subtree sets.

C2. The vectorized representation of the root node in the tree structure is the same as the vectorized representation of the tree node in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the recursive long-short term memory layer. Let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C₁,c₂,…,c_i,…,c_|c|Is a set of corresponding subtrees

Wherein c is_iIs composed of

The root node of (2). The calculation formula of the T vectorization expression is formula 1:

wherein,

representing an activation function, W_root、W_pickupAnd W_subtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the recurved _ LSTM layer represented by the vectorization representation of each m-ary tree in F, and is represented by formula 2:

C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; 2) and the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:

c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as Feature^R＝[f¹,f²,…,f^k]^T。

C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to process Feature^RDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature^p＝[f^max,f^min,f^mean]^T。

C33. Splicing Layer (ligation Layer) will Feature^pAnd the vector f corresponding to the leaf feature^sSpliced into a vector, Feature^A＝f^max&f^min&f^mean&f^s，(&A representation tile); feature^AIs for the entropy of information, longestThe words,

Representing a coincidence index, a compression ratio and a danger function;

C34. subsequent full connectivity layer utilization Feature^AAnd performing WebShell judgment.

Specifically, given a decision threshold, when Feature^AAnd when the judgment threshold limit is exceeded, the file is identified as the WebShell file.

The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, the accuracy Precision is set as U, the Recall rate is set as V, and the accuracy U is the number of correct extracted information pieces/the number of extracted information pieces, which is also called Precision; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. In the implementation of the invention, a Precision-Recall curve is drawn to help the adjustment analysis and the parameters are selected.

The invention discloses a recursive recurrent neural network detection method based on an abstract syntax tree. First, the script code is converted into an abstract syntax tree using a lexical analyzer. Then, a compression algorithm of the abstract syntax tree is invented. And finally, aiming at the structural characteristics of the abstract syntax tree, the invention provides a recurrent neural network model as a detection module.

In another aspect, the present invention further provides a WebShell detection system of a recurrent neural network based on an abstract syntax tree, where the system includes:

1. the preprocessing module is used for outputting an abstract syntax tree by using a syntax analyzer as a preprocessing module and using script source codes as input through syntax analysis;

2. the sample generation module, the sample generation module of the AST _ RRNN method, is responsible for converting the abstract syntax tree into a vector representation that facilitates the training and prediction of the detection module. The vectorized representation of the abstract syntax tree is divided into two parts: 1) and performing vectorization representation on the leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method. 2) Designing a sampling algorithm to limit the scale of an abstract syntax tree part consisting of intermediate nodes, wherein the basic idea is to replace an original abstract syntax tree by utilizing a group of sampling subtrees with smaller scale;

3. the detection module is a deep neural network model, and constructs a recurrent neural network (RRNN). The user-defined Recursive long-short term memory layer recursion _ LSTM provides a bottom-up operation mode for the tree structure. Because the input of the RRNN is k tree structures, the bottom end of the RRNN is k recurrent _ LSTM layers sharing parameters, the operation result is spliced with the vector expression of the leaf nodes after being processed by the pooling layer, and finally the operation result is input into a subsequent full-connection layer.

The invention has the beneficial effects that:

the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of a script by the program language processing technology aiming at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like, and completes feature extraction and WebShell detection by utilizing a deep neural network. The WebShell detection by using the technical scheme provided by the invention has the following advantages:

1) the features are automatically extracted, and the dependence on feature engineering is avoided;

2) the portability is good, and the thought and the flow are suitable for any scripting language;

3) the static detection method can be deployed at a Web server side in a light weight mode, and is low in deployment and detection cost;

4) the detection accuracy is high: compared with various test modes, the WebShell detection method based on the recurrent neural network of the abstract syntax tree can effectively deal with some relatively new WebShell type files (such as 0dayWebShell), and has good searching and killing effects on some deformed, encrypted and existing WebShell files.

Drawings

Fig. 1 is a flow chart of the WebShell detection method provided by the present invention.

Fig. 2 is a block diagram of a flow of a preprocessing module in the WebShell file detection process in the embodiment of the present invention.

Fig. 3 is a block flow diagram of a sample generation module in the WebShell file detection process in the embodiment of the present invention.

Fig. 4 is a block diagram of a flow of a detection module in the WebShell file detection process in the embodiment of the present invention.

Fig. 5 is a block diagram of the system provided by the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a method and a system for detecting a WebShell (WebShell) based on a recurrent neural network of an abstract syntax tree, wherein the system comprises a preprocessing module, a sample generating module and a detecting module; the detection of the WebShell file of the website is realized through the above process, and the specific implementation mode of the invention is as follows (in this example, PHP script is taken as an example for explanation, and other types of script languages are the same):

A. the preprocessing module, this part includes lexical analyzer, syntax analyzer, simplification step, specifically as follows (as fig. 2):

A1. a lexical analyzer, a PHP type file F containing program codes (script source codes), and generating a lexical unit stream WS after the lexical analysis;

A2. and (4) utilizing a syntax analyzer PHP-parser to perform syntax analysis on the WS to construct an abstract syntax tree AST.

The parsing process typically filters out semantically irrelevant information, such as annotations. The syntactic analysis is based on lexical analysis, which is more stringent than the lexical analysis rules. Meanwhile, compared with the lexical unit stream, the abstract syntax tree can reflect the code structure information more accurately.

A3. The abstract syntax tree is simplified, the structure of the abstract syntax tree after PHP-pharse analysis is clear, but the abstract syntax tree is slightly redundant, the abstract syntax tree needs to be simplified, and the simplification steps are as follows:

A31. deleting all leaf nodes of the abstract syntax tree, and simultaneously, in order to not lose leaf node characteristics, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method through a sample generation module;

A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.

B. Sample generation module (as detailed in fig. 3):

B1. in the compression of the abstract syntax tree, since the difference in the size of the abstract syntax tree (the nodes of the tree) adversely affects the training and prediction of the detection module, it is necessary to compress the abstract syntax tree before vectorizing the abstract syntax tree. The abstract syntax tree is limited in size mainly by the concept and method of n-node sampling sub-tree and m-tree transformation. The specific compression steps are as follows:

B11. and repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T (V, E). The result returned by the n-node sampling subtree algorithm is called a sampling subtree. This step therefore ultimately produces a set of sampled subtrees of size K, denoted as

Wherein for any

Satisfy the requirement of

Does not exceed n.

B12. At F_sampleIn which a subset F of size k is determined_selectI.e. F_select∈F_sampleAnd | F_selectI | ═ k, such that

Satisfies the following conditions:

wherein the T () function is a value evaluation function for evaluating the argument F_subThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation F_subThe amount of information that can contribute to the WebShell detection conclusion. The form and meaning of the T () function need to be customized, and the value evaluation function is defined as:

wherein, ω is₁、ω₂、ω₃The values are all set to be 1 in the scheme; sigma (),

pi () is three value ranges of [0,1 ]]Respectively, for measuring F_selectCoverage, suspicion, and variability. In the scheme, T () is the linear sum of three types of index values. Wherein:

coverage σ () function, in this scheme F is expected_selectContains as many nodes as possible in T to obtain more information of T. Thus, the coverage is defined as F_selectAt the ratio of the size of the set of nodes to | V |, the formula:

wherein,

is composed of

A set of points.

Suspicion of

A function. Suppose that

Are two n-node sampled sub-trees of the abstract syntax tree T. If it is

Just corresponding to the WebShell functional part in the source code,

the corresponding malicious-free obfuscated code portion, in the WebShell detection problem,

obvious ratio

More has "suspicious degree". Similarly, for node v_iRatio v_jMore "suspicion degree", therefore, the suspicion degree of the node v is defined as:

wherein, c_vWebShell represents the number of times v appears in all WebShell scripts in the training set, c_vAll indicates the number of times v occurs in all scripts in the training set. Defining n-node sampling subtree T_sampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:

accordingly, define F_selectIs F_selectAverage value of suspicion degree of all n-node sampling subtrees:

a diversity pi () function.If it is

And

node types and structures are almost the same, then set

Is unlikely to provide a ratio

More useful information. Therefore, F is desired_selectThe sampled subtrees in (1) are as dissimilar as possible. Definition F_selectThe diversity of (A) is:

wherein, the Tree _ Diversity () is a Tree distance algorithm, and the distance between two trees is calculated.

The Tree distance algorithm Tree _ Diversity () is specifically as follows:

B13. to pair

All the sampled subtrees in (1), perform an m-ary tree transformation algorithm, denoted as

The m-ary tree transformation algorithm limits the size of the child nodes of any tree to m, and ensures that the size of the tree with any size of n is smaller than 2n after the m-ary tree transformation is carried out on the tree with any size of n. F is to be_transferInstead of the abstract syntax tree T, as input to the detection module.

And (5) m-ary tree transformation algorithm. The idea is as follows: if the child node set C of the node v exceeds the size limit, namely | C | > m, a layer of filling nodes is added between v and C until the child node size of v meets the limit. The filling node only reduces the size of the child node, does not contain any syntactic semantic information, and is defined as a 0 vector when the vector is represented. The specific algorithm is as follows:

let T be_sampleThe sub-tree is sampled for n nodes, and the tree T is obtained after m-way tree transformation_transferObviously, the size of the child node is not larger than m. The filling nodes introduced in the m-ary tree transformation process are all T_transferSo that the number of padding nodes is necessarily less than | V_sampleL, i.e. T_transferIs less than 2| V_sample|。

At this point, the compression process for the abstract syntax tree is complete.

B2. Vectorized representation of tree nodes. The vectorization representation of the tree nodes adopts a One hot encoding (One hot encoding) method, the method is the most intuitive vectorization encoding method, and the nodes v of the vectorization abstract syntax tree adopting the One hot encoding are marked as One _ hot (v) and represent the node types of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.

Let T be_transfer＝{V_transfer,E_transferAnd generating an n-node sampling sub-tree of T ═ V, E through m-ary tree transformation. To V_transferOf arbitrary non-filler nodes v, T and T_transferSubtrees T with v as root node^(v)And

in the vectorization process, the vectorized representation of v consists of two parts: the first part is used for representing the type of the node v, and adopts a one-hot coding mode, and is marked as:

Encode(v)＝one_hot(v)

the second part being intended to represent T^(v)Is not sampled, and

the calculation formula of the node set of nodes that have not been "picked back" is:

for filler nodes, both partial representations are specified as 0 vectors.

And establishing a danger function list aiming at the script language, and searching whether the character string scalar node contains a danger function field or not by comparing the danger function list. And the risk function characteristics are vectorized and expressed by a bag-of-words model. The length of the feature vector is equal to the length of the list of the danger functions; the statistical characteristics of the character strings are interpreted from the mathematical point of view, and after the character strings are subjected to fuzzy processing such as confusion, encoding, encryption and the like, certain mathematical statistics of the character strings are usually deviated from the probability distribution of the character strings in a normal script. This is also the rationale for the NeoPi (an open source tool published by Neohapsis on github) method. The NeoPi is a script tool written by Python, detects malicious codes existing in texts and script files by using various statistical methods, and mainly detects the malicious codes by extracting the information entropy, the longest word, the coincidence index, the characteristics and the compression ratio of the files. The method selects 4 important indexes of the character string length, the coincidence index, the information entropy and the file compression ratio in the NeoPi method, and inspects each character string constant in the script.

String Length (Length of String). The string constants in the normal code are concise, and the code fragments are embedded into the string constants by part of WebShell, so that long strings are more likely to appear in the WebShell script compared with normal scripts.

Coincidence Index (Index of Coincidence). The coincidence index is one way to determine whether a file is encrypted or encoded. The calculation formula is as follows:

IC(s)＝∑(f_i*f_i-1)/N*(N-1)

wherein f is_iRepresenting the frequency of occurrence of the character i in the string s in the sample, and N is the length of the string. Statistics show that the coincidence index of a meaningful english text is 0.0667, while the index of a completely random english string is 0.0385. That is, when the coincidence index of an english string is close to 0.0385, we just tend to consider it encrypted or encoded, thereby further inferring that the script is likely to be WebShell.

Entropy of Information (Entropy of Information). Information entropy is a basic concept in information theory and is a measure of the degree of system ordering. The calculation formula is as follows:

H(s)＝-∑p_i*logp_i

wherein p is_iThe proportion of the string i that appears in the string s. Therefore, when the character string is pseudo-randomized by encryption or coding, the information entropy increases, and therefore the larger the information entropy value is, the higher the possibility of WebShell is.

The file compression ratio, which is defined as the ratio of the uncompressed file size to the compressed file size. The essence of data compression is to eliminate the imbalance in the distribution of specific characters, and to achieve length optimization by assigning short codes to high frequency characters, while low frequency characters use long codes. A web page file encoded by base64, with non-ASCII characters removed, will appear as a smaller distribution imbalance, with the larger compression ratio, calculated as follows:

wherein zip () represents compressing data and length () represents calculating data length.

C. A detection module (see fig. 4). The method adopts a deep neural network as a detection module, adopts a recursion cycle neural network aiming at a tree structure of an abstract syntax tree, and comprises the following concrete steps:

C1. aiming at the tree structure, the scheme defines a new neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The basic idea of the recurve _ LSTM layer is: using the recursive nature of the tree, a vector representation of the tree is generated by some non-linear operation from vector representations of its root node and set of subtrees.

C2. The vectorized representation of the root node is identical to the vectorized representation of the tree nodes in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the long-short term memory layer. Formally, let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C₁,c₂,…,c_i,…,c_|c|Is a set of corresponding subtrees

Wherein c is_iIs composed of

The root node of (2). The calculation formula of the T vectorization representation is:

wherein,

representing an activation function, W_root、W_pickupAnd W_subtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the vectorized representation of each m-ary tree in F into the LSTM layer, and is represented as:

C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees represent intermediate nodes of the abstract syntax tree; 2) the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:

C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to F at the same time^RDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature^p＝[f^max,f^min,f^mean]^T。

C32. Splicing Layer (splicing Layer) F^pAnd the vector f corresponding to the leaf feature^sSpliced into a vector, Feature^A＝f^max&f^min&f^mean&f^s。(&Express splice mark)

The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, setting the accuracy as U, the recall rate as V, and the accuracy U being the number of correct extracted information pieces/the number of extracted information pieces, which is also called precision rate; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. Generally, the accuracy Precision is how many retrieved items are accurate, and the Recall is how many all accurate items are retrieved. In practice, it is of course desirable that the search result Precision is as high as possible, and that the Recall is as high as possible, but in fact, the two are in some cases contradictory. For example, in an extreme case, only one result is searched in the experiment and is accurate, then Precision is 100%, but Recall is very low; if all results are returned, then for example Recall is 100%, but Precision is low. Therefore, in different situations, it is necessary to judge whether a higher Precision or a higher Recall is desired.

The details of RRNN are shown in Table 1-1. In the training process, a binary cross information entropy function is adopted as a loss function, a method SGD (stochastic gradient descent) is adopted as a training method, the number of samples in each batch of training is 32, and the training iteration number is 1000.

TABLE 1 AST _ RRNN method detailed description of detection module parameters

The invention is further illustrated by the following examples.

Example (b):

the scheme adopts supervised training, and the mainstream method for training the deep neural network is a random gradient descent (SGD) method and a deformation form thereof. The method inputs a group of training samples into the neural network every time, and updates parameters of the neural network by using the value of the objective function until the value of the objective function is converged. The specific updating method is to move all parameters in the neural network by a small step along the direction of gradient decrease of the objective function (the opposite direction of the derivative).

The sample set of the example is selected, and the sample set contains a large number of normal scripts and 6669 WebShell scripts. 100000 scripts are drawn from the normal sample set for training the token's word vector. The remaining normal scripts are randomly extracted 6669, and form a training set of classification problems together with all WebShell scripts.

Tables 1-2 example training set and test set partitioning of data sets

	Training set	Test set	Total of
				WebShell script	5187	1482	6669
Normal script	5187	1482	6669

1) Firstly, using the lexical analysis results of 100000 PHP scripts as input;

2) generating an abstract syntax tree by using the PHP-parser;

3) the determination of 4 key parameters in the sample generation module is ① n, namely the scale of a limiting tree, ② m, namely the scale of sub nodes of the limiting tree, ③ K, namely the size of a sampling subtree set, ④ K, namely the number of final input m-ary trees, in the RRNN training process, when any abstract syntax tree T is (V, E), when a sample is constructed, n is fixed to 1000, m is fixed to 10, K is min (50,

) And K is min (K, 10). And when the training of the RRNN model is finished, fixing 3 parameter values in each training process, respectively taking different values for the residual variables, and recording the detection result.

4) In the test process, it is found that the detection effect of the AST _ RRNN method is generally improved when the values of n, m, K, and K are increased. Therefore, in the detection process, the values of the 4 parameters can be properly increased according to the size of the abstract syntax tree, so that the detection accuracy is improved.

The AST _ RRNN method utilizes two types of features, ① features extracted from leaf nodes, ② features extracted from abstract syntax trees, and retrains the parameters that adjust the RRNN based on the trained RRNN using the leaf node features and the abstract syntax tree features, respectively.

1) Accuracy 0.9886 using leaf node features and abstract syntax tree features as input;

2) accuracy 0.7649 when only leaf node features are used;

3) accuracy 0.8659 when only abstract syntax tree features are used;

the detection effect of the abstract syntax tree as the characteristic is obviously higher than that of the detection result of the leaf node as the characteristic, which shows that the structural information in the abstract syntax tree is important for WebShell detection. Moreover, only by using a single feature, both the abstract syntax tree and the leaf node can cause the accuracy rate to be reduced by at least 10%, the phenomenon is that the leaf node features describe key information of a data transmission part, the abstract syntax tree is an accurate description of a data execution part, and the two functions together guarantee the detection result of the AST _ RRNN method.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A WebShell detection method based on a deep neural network is characterized in that a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntactic information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the WebShell detection method comprises a preprocessing process, a sample generation process and a detection process; the method specifically comprises the following steps:

A. the script file preprocessing process comprises the following steps:

the input is script source code, the preprocessing comprises lexical analysis, syntactic analysis and simplification, and the output is abstract syntax tree T ═ (V, E), wherein V is a set of leaf nodes in T, and E is a set of edges in T;

B. and (3) a sample generation process:

inputting leaf nodes comprising the simplified abstract syntax tree and the abstract syntax tree; the method comprises the following steps: compressing the abstract syntax tree and vectorizing the abstract syntax tree, wherein the vectorized abstract syntax tree comprises vectorized expressions of tree nodes and leaf nodes;

C. adopting a deep neural network to carry out WebShell detection: aiming at the tree structure of the abstract syntax tree, the deep neural network adopts a recursion cycle neural network; the method comprises the following steps:

C1. defining a neural network layer as a recursive long-term and short-term memory layer aiming at a tree structure, wherein the recursive long-term and short-term memory layer utilizes the recursive characteristic of the tree and is represented by vectors of a root node and a subtree set of the tree, and the vectors of the tree are represented by nonlinear operation;

C2. the vectorization representation method of the root node in the tree structure adopts the same method as the vectorization representation of the tree node in the step B; vector representation of a sub-tree set in the tree structure is generated by inputting sub-trees into a recursive long-short term memory layer in sequence;

C3. designing a recurrent neural network RRNN as a detection module by utilizing the recurrent long and short term memory layer;

the inputs to the RRNN include: k vectorized m-order multi-way trees representing intermediate nodes of the abstract syntax tree; a fixed-length vector representing a leaf node of the abstract syntax tree; the operation process of the RRNN comprises the following steps:

the bottom of RRNN comprises k recursive long and short term memory layers sharing weight, corresponding to k m-trees, and outputting k x d dimensional characteristics through calculation, which is recorded as Feature^R＝[f¹,f²,…,f^k]^T，f^kIs k d-dimension feature vector;

c32.RRNN pooling layerSimultaneously using three down-sampling functions of maximum value, minimum value and mean value to Feature^RThe down sampling operation is carried out according to the columns, and the pooling layer outputs three d-dimensional vectors which are marked as Feature^p＝[f^max,f^min,f^mean]^TWherein f is^maxD-dimensional vector, f, output using maximum sampling function for pooling layers^minD-dimensional vector, f, output using minimum sampling function for pooling layers^meanD-dimensional vectors output by using a mean sampling function for the pooling layer;

RRNN splice layer Feature^pAnd vector f corresponding to leaf feature^sSpliced into a vector to obtain a spliced Feature vector Feature^A＝f^max&f^min&f^mean&f^s，&Representing a splice;

c34.RRNN full connectivity layer utilizes Feature vector Feature^AAnd performing WebShell judgment.

2. The WebShell detection method as recited in claim 1, wherein the step of preprocessing the script file specifically comprises:

A2. carrying out lexical analysis on the lexical unit flow to construct an abstract syntax tree;

A3. and filtering the lexical unit streams subjected to the syntactic analysis to remove semantic irrelevant information so as to achieve the purpose of simplifying the abstract syntax tree.

3. The WebShell detection method of claim 2, wherein the step a3 of simplifying the abstract syntax tree comprises the steps of:

A31. deleting all leaf nodes of the abstract syntax tree, simultaneously, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method when generating a sample, wherein the leaf node characteristics are not lost;

4. The WebShell detection method of claim 1, wherein the step B sample generation process comprises:

B1. compression of abstract syntax trees: limiting the size of the abstract syntax tree by using an n-node sampling sub-tree and an m-ary tree transformation method; vectorization representation of the leaf nodes can be completed by utilizing a characteristic engineering method;

B2. vectorization represents tree nodes: the vectorization coding method adopts a one-hot coding method, and adopts a node v of a one-hot coding vectorization abstract syntax tree, which is marked as one _ hot (v) and represents the node type of v; adopting a bag-of-words model vectorization abstract syntax tree T, recording as BoW (T), and representing the number of each type of nodes in the T;

B3. vectorization represents a leaf node: and extracting the leaf nodes of the character string scalar type to obtain a danger function characteristic and a character string statistical characteristic.

5. The WebShell detection method of claim 4, wherein step B1 includes the following steps:

B11. repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T ═ V, E, generating a sampling subtree set with the size of K, and recording the sampling subtree set as K

Wherein for any

Satisfy the requirement of

The scale of (a) does not exceed n;

sampling a sub-tree generated by a sub-tree algorithm for the nth node at the Kth time;

B12. at F_sampleIn order to obtain a sizeSubset F of k_selectI.e. F_select∈F_sampleAnd | F_selectI | ═ k, such that

Satisfies the following formula:

wherein the T () function is a value evaluation function for evaluating the argument F_subThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation F_subThe amount of information which can contribute to the WebShell detection conclusion is obtained; the value evaluation function T () is defined as follows:

wherein, ω is₁、ω₂、ω₃Is a constant; coverage function σ (), suspicion function

The diversity function pi () has three value ranges as the interval [0,1 ]]Respectively, for measuring F_selectCoverage, suspicion, and variability of; the cost evaluation function T () is a linear sum of three types of metric values.

6. The WebShell detection method of claim 5, wherein, in the definition of the value evaluation function T (),

wherein, ω is₁、ω₂、ω₃Is a constant; preferably, ω is₁、ω₂、ω₃Are all set to 1.

7. The WebShell detection method of claim 5, wherein the coverage function σ () is F_selectRatio of size of node set to | V |:

wherein,

is composed of

A set of points;

function of doubtability

Is defined as: suppose that

Is two n-node sampling subtrees of the abstract syntax tree T; if it is

Just corresponding to the WebShell functional part in the source code,

and the corresponding part of the obfuscated code without malice is detected by the WebShell,

ratio of

More has 'doubtful degree'; for node v_iRatio v_jMore has 'doubtful degree'; the suspiciousness of the node v is defined as:

wherein, c_vThe WebShell represents the number of times that v appears in all WebShell scripts in the training set; c. C_vAll means v is trainingTraining the occurrence frequency of all scripts; defining n-node sampling subtree T_sampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:

the definition of the diversity function pi () is specifically: definition F_selectThe diversity of (a) is as follows:

wherein, the distance between two trees is calculated through Tree _ Diversity ().

8. The WebShell detection method of claim 1, wherein in step C2, assuming that the root node of the tree T ═ (V, E) is r, the set of child nodes in r is C ═ C₁,c₂,…,c_i,…,c_|c|Is the corresponding subtree set

Wherein c is_iIs composed of

A root node of; c. C_|c|The last child node of the root node r; the tree T is represented vectorized by equation 1:

wherein,

representing an activation function; w_root、W_pickupAnd W_subtreeIs a parameter; encode (F) is the final output result of the recursive long and short term memory layer which is sequentially input by the vectorization representation of each m-ary tree in F, and is expressed as formula 2.

9. The WebShell detection method of claim 1, wherein in step C34, the fully-connected layer utilizes Feature vectors Feature^AMaking a WebShell decision, specifically, giving a decision threshold when Feature^AAnd when the judgment threshold is exceeded, identifying the file as the WebShell file.

10. The WebShell detection system realized by the WebShell detection method of claims 1-9 comprises a preprocessing module, a sample generation module and a detection module;

the preprocessing module takes script source codes as input by using a syntax analyzer and outputs an abstract syntax tree through syntax analysis;

the sample generation module is configured to translate an abstract syntax tree into vector expressions that facilitate training and prediction by a detection module, and includes: performing vectorization representation on leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method; limiting the scale of an abstract syntax tree part consisting of intermediate nodes through a sampling algorithm, and replacing an original abstract syntax tree by using a group of sampling subtrees with smaller scale;

the detection module is a deep neural network model, constructs a recurrent neural network, self-defines a recurrent long-term and short-term memory layer and provides bottom-up operation on the tree structure; the bottom end of the recurrent neural network is a recurrent _ LSTM layer with k shared parameters, the input of the recurrent _ LSTM layer is k tree structures, the operation result is spliced with the vector expression of the leaf nodes after being processed by a pooling layer, and finally the operation result is input into a full connection layer for WebShell detection.