CN107516041A

CN107516041A - WebShell detection methods and its system based on deep neural network

Info

Publication number: CN107516041A
Application number: CN201710705914.1A
Authority: CN
Inventors: 张涛; 齐龙晨; 宁戈
Original assignee: Beijing An Punuo Information Technology Co Ltd
Current assignee: Beijing An Punuo Information Technology Co Ltd
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2017-12-26
Anticipated expiration: 2037-08-17
Also published as: CN107516041B

Abstract

The invention discloses a kind of WebShell detection methods and its system based on deep neural network, recursion cycle neutral net based on abstract syntax tree, obtain morphology, the syntactic information of script automatically for script, feature extraction and WebShell detections, including pretreatment, sample generation and WebShell detections are completed using the hierarchical structure feature of abstract syntax tree；Obtain morphology, the syntactic information of script automatically first, recycle the recursion cycle neutral net based on abstract syntax tree to complete feature extraction and WebShell detections.The lower deployment cost of the inventive method is low, portability is good, Detection accuracy is high.

Description

WebShell detection method and system based on deep neural network

Technical Field

The invention relates to the technical field of information security, in particular to a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree.

Background

WebShell is a command execution environment in the form of a web page, often used by intruders as a backdoor tool for operating web servers. An attacker obtains the management authority of the Web service through the WebShell, so that penetration and control on Web application are achieved.

Since the characteristics of the WebShell and the common Web page are almost consistent, the detection of the traditional firewall and antivirus software can be avoided. And with the application of various feature confusion hiding technologies for anti-detection to WebShell, a traditional detection mode based on feature code matching is difficult to detect new variants in time.

From the attacker perspective, webShell is a script Trojan backdoor written by asp, aspx, php or jsp and the like. After an attacker invades a website, the script files are often uploaded to a Web server directory. Through the browser access mode, the Web server can be controlled, such as reading website database data and deleting files on the website server, while accessing the script file, and even the system branch command can be directly operated if the Web authority is higher.

The existing WebShell detection methods are white box detection, namely detection is carried out on a source code of a WebShell script file, and the existing WebShell detection methods can be specifically divided into two types of detection based on a host and detection based on a network.

Host-based detection: among these methods, the detection method more common in the industry is to directly use known keywords as features, search suspicious files by grep sentences and then manually analyze the suspicious files, or periodically check the MD5 value of existing files and check whether new files are generated by using a program. This intuitive detection method is easily circumvented by attackers using obfuscation means.

Network-based detection: the current existing method mainly focuses on configuring an intrusion detection system, namely a WAF, at a network entrance to detect WebShell, and judges whether an attacker uploads HTML or script files by analyzing whether special keywords (e.g., < form, <%, <. The method needs large expenditure and also has the possibility of false alarm; and only the behavior of uploading WebShell can be detected, but the existing WebShell cannot be detected.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of the script mainly through the program language processing technology, and completes feature extraction and WebShell detection by utilizing a deep neural network at the same time, wherein the method is mainly aimed at mainstream script languages, including PHP, javaScript, perl, python, ruby and the like. The system mainly comprises three modules, namely a preprocessing module utilizing a programming language processing technology, a sample generation module completing vectorization expression and a detection module utilizing a deep learning technology. The method has the advantages of low deployment cost, good transportability and high detection accuracy.

The following are several typical neural network model-related term definitions:

the operational formula of the neural network can be defined as:

wherein o is ⁽ⁱ⁾ Output vector, o, representing the i-th layer of the neural network ⁽ⁱ⁾ The dimension of (a) is the number of neurons (number of network nodes); x is a radical of a fluorine atom ⁽ⁱ⁾ Is the output of the i-1 th layer of the network and is used as the input of the i-th layer; w ⁽ⁱ⁾ And b ⁽ⁱ⁾ Is a parameter of the i-th layer of the neural network; balanceFor activating functions, in generalIs a non-linear function. This neural network layer is called a fully connected layer (Full Connection layer).

A Recurrent Neural Network (RNN) is used to process sequence inputs. The recurrent neural network processes one input sequence element at a time while maintaining the historical state of all past time sequence elements with one hidden unit.

The calculation formula of the recurrent neural network layer is as follows:

wherein x is _t Is the input vector, S _t Is a hidden unit vector, o _t Is the output vector; w, U, V is a parameter,is an activation function.

The first occurrence of the Pooling Layer (Pooling Layer) in convolutional neural networks was a down-sampling window that slides through the input matrix. And the pooling layer performs down-sampling in the corresponding matrix sub-area according to the sampling function each time, slides to the next position according to the specified step until the sampling in the whole input matrix is finished, and finally outputs the sampling result matrix to the next layer. The most common sampling methods are maximum sampling, minimum sampling and mean sampling.

The splicing Layer (splicing Layer) is responsible for merging the input k vectors into one output vector, namely:

o＝i ₁ &i ₂ &,…,&i _k in which&Are hyphenated.

The technical scheme provided by the invention is as follows:

a WebShell detection method based on deep Neural Network, based on the Recurrent Neural Network (AST _ RRNN, recurrent Neural Network based on abstract syntax tree), utilize the hierarchical structural feature of the abstract syntax tree, carry on WebShell to the script language of the mainstream; the method comprises a pretreatment process, a sample generation process and a detection process; the method specifically comprises the following steps (the scheme flow diagram is specifically shown in figure 1):

A. firstly, preprocessing a script file: the preprocessing module comprises a lexical analyzer, a syntax analyzer and a simplification module, wherein the input is script source codes, and the output is an abstract syntax tree AST (abstract syntax tree), and the specific steps are as follows:

A1. performing lexical analysis on the program codes to generate a lexical unit stream;

A2. the lexical analyzer analyzes the lexical unit flow to construct an abstract syntax tree;

A3. and simplifying the steps, namely filtering out semantic irrelevant information such as comments and the like after the step of syntactic analysis.

B. And (4) generating a sample. The sample generation module of the WebShell detection method of AST _ RRNN comprises two parts of input contents: simplified AST and AST leaf nodes. However, since the difference in the size of the abstract syntax tree (the number of nodes in the tree) adversely affects the training and prediction of the detection module, the abstract syntax tree needs to be compressed before vectorization. And the sample generation module is responsible for converting the abstract syntax tree into a vectorized representation that facilitates training and prediction by the detection module. The compression steps of the abstract syntax tree are as follows:

B1. the compression of abstract syntax tree mainly utilizes the concepts and methods of n-node sampling sub-tree and m-branch tree transformation to limit the size of abstract syntax tree, and in addition, simple characteristic engineering method is utilized to complete the vectorization representation of leaf nodes.

B2. Vectorized representation of tree nodes. Adopting a One Hot Encoding (One Hot Encoding) method as the most intuitive vectorization Encoding method, and adopting the One Hot Encoding to vectorize a node v of an abstract syntax tree, marking as One _ Hot (v), and representing the node type of the v; and (3) adopting a Bag of words model (Bag of Word) to vector the abstract syntax tree T, and marking the abstract syntax tree T as BoW (T) to represent the number of each type of node in the T.

B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.

C. And (3) detection process: and a deep neural network is adopted as a detection module. And aiming at the tree structure of the abstract syntax tree, a recursive cyclic neural network is adopted. The method comprises the following steps:

C1. aiming at the tree structure, the scheme defines a neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, recursive Long Short Term Memory Layer). The recursion _ LSTM layer exploits the Recursive nature of trees whose vector representation is generated by some non-linear operation from the vector representations of their root nodes and subtree sets.

C2. The vectorized representation of the root node in the tree structure is the same as the vectorized representation of the tree node in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the recursive long-short term memory layer. Assuming that the root node of the tree T = (V, E) is r, let the set of r child nodes be C = { C = { (C) } ₁ ,c ₂ ,…,c _i ,…,c _|c| Is a set of corresponding subtreesWherein c is _i Is composed ofThe root node of (2). The calculation formula of the T vectorization expression is formula 1:

wherein the content of the first and second substances,representing an activation function, W _root 、W _pickup And W _subtree Is a parameter. Encode (F) is the final output result of sequentially inputting the recurive _ LSTM layers by vectorization representation of each m-ary tree in F, and is expressed as formula 2:

C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; 2) And the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation process of RRNN is described as follows:

c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as Feature ^R ＝[f ¹ ,f ² ,…,f ^k ] ^T 。

C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to process Feature ^R Downsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature ^p ＝[f ^max ,f ^min ,f ^mean ] ^T 。

C33. Splicing Layer (ligation Layer) will Feature ^p And the vector f corresponding to the leaf feature ^s Spliced into a vector, feature ^A ＝f ^max &f ^min &f ^mean &f ^s ，(&A representation tile); feature of ^A For information entropy, longest word,

Representing a coincidence index, a compression ratio and a danger function;

C34. subsequent full connectivity layer utilization Feature ^A And performing WebShell judgment.

Specifically, given a decision threshold, when Feature ^A And when the judgment threshold limit is exceeded, the file is identified as the WebShell file.

The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, the accuracy ratio Precision is set as U, the Recall ratio Recall is set as V, and the accuracy ratio U = the number of correct extracted information pieces/the number of extracted information pieces, which is also called Precision ratio; recall V = number of correct pieces of information extracted/number of pieces of information in a sample, also known as recall. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. In the implementation of the invention, a Precision-Recall curve is drawn to help the adjustment analysis and the parameters are selected.

The invention discloses a recursive recurrent neural network detection method based on an abstract syntax tree. First, the script code is converted into an abstract syntax tree using a lexical analyzer. Then, a compression algorithm of the abstract syntax tree is invented. And finally, aiming at the structural characteristics of the abstract syntax tree, the invention provides a recurrent neural network model as a detection module.

In another aspect, the present invention further provides a WebShell detection system of a recurrent neural network based on an abstract syntax tree, where the system includes:

1. the preprocessing module is used for outputting an abstract syntax tree by using a syntax analyzer as a preprocessing module and using script source codes as input through syntax analysis;

2. the sample generation module, the sample generation module of the AST _ RRNN method, is responsible for converting the abstract syntax tree into a vector representation that facilitates the training and prediction of the detection module. The vectorized representation of the abstract syntax tree is divided into two parts: 1) And performing vectorization representation on the leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method. 2) Designing a sampling algorithm to limit the scale of an abstract syntax tree part consisting of intermediate nodes, wherein the basic idea is to replace an original abstract syntax tree by utilizing a group of sampling subtrees with smaller scale;

3. the detection module is a deep neural network model, and constructs a recurrent neural network (RRNN). The user-defined Recursive long-short term memory layer recursion _ LSTM provides a bottom-up operation mode for the tree structure. Because the input of the RRNN is k tree structures, the bottom end of the RRNN is k recurrent _ LSTM layers sharing parameters, the operation result is spliced with the vector expression of the leaf nodes after being processed by the pooling layer, and finally the operation result is input into a subsequent full connection layer.

The invention has the beneficial effects that:

the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The method simultaneously introduces a program language processing technology and a deep learning technology, automatically acquires lexical and grammatical information of the script by the program language processing technology aiming at mainstream script languages including PHP, javaScript, perl, python, ruby and the like, and completes feature extraction and WebShell detection by utilizing a deep neural network. The WebShell detection by using the technical scheme provided by the invention has the following advantages:

1) The features are automatically extracted, and the dependence on feature engineering is avoided;

2) The portability is good, and the thought and the flow are suitable for any scripting language;

3) The static detection method can be deployed at a Web server side in a light weight mode, and is low in deployment and detection cost;

4) The detection accuracy is high: compared with various test modes, the WebShell detection method based on the recurrent neural network of the abstract syntax tree can effectively deal with some relatively new WebShell type files (such as 0day WebShell), and has good searching and killing effects on some deformed, encrypted and existing WebShell files.

Drawings

Fig. 1 is a flow chart of the WebShell detection method provided by the present invention.

Fig. 2 is a block diagram of a flow of a preprocessing module in the WebShell file detection process in the embodiment of the present invention.

Fig. 3 is a block flow diagram of a sample generation module in the WebShell file detection process in the embodiment of the present invention.

Fig. 4 is a block diagram of a flow of a detection module in the WebShell file detection process in the embodiment of the present invention.

Fig. 5 is a block diagram of the system provided by the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a method and a system for detecting a WebShell (WebShell) based on a recurrent neural network of an abstract syntax tree, wherein the system comprises a preprocessing module, a sample generating module and a detecting module; the detection of the WebShell file of the website is realized through the above process, and the specific implementation mode of the invention is as follows (in this example, PHP script is taken as an example for explanation, and other types of script languages are the same):

A. the preprocessing module, this part includes lexical analyzer, syntax analyzer, simplification step, specifically as follows (as fig. 2):

A1. a lexical analyzer, a PHP type file F containing program codes (script source codes), and generating a lexical unit stream WS after the lexical analysis;

A2. and (4) utilizing a syntax analyzer PHP-parser to perform syntax analysis on the WS to construct an abstract syntax tree AST.

The parsing process typically filters out semantically irrelevant information, such as annotations. The syntactic analysis is based on lexical analysis, which is more rigorous than the lexical analysis rules. Meanwhile, compared with the lexical unit stream, the abstract syntax tree can reflect the code structure information more accurately.

A3. The abstract syntax tree is simplified, the structure of the abstract syntax tree after PHP-pharse analysis is clear, but the abstract syntax tree is slightly redundant, the abstract syntax tree needs to be simplified, and the simplification steps are as follows:

A31. deleting all leaf nodes of the abstract syntax tree, and simultaneously, in order to not lose leaf node characteristics, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method through a sample generation module;

A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.

B. Sample generation module (as detailed in fig. 3):

B1. in the compression of the abstract syntax tree, since the difference in the size of the abstract syntax tree (the nodes of the tree) adversely affects the training and prediction of the detection module, it is necessary to compress the abstract syntax tree before vectorizing the abstract syntax tree. The abstract syntax tree is limited in size mainly by the concept and method of n-node sampling sub-tree and m-tree transformation. The specific compression steps are as follows:

B11. for any abstract syntax tree T = (V, E), repeatedly calling the n-node sampling subtree algorithm for K times. The result returned by the n-node sampling subtree algorithm is called a sampling subtree. This step therefore ultimately produces a set of sampled subtrees of size K, denoted as Wherein for anySatisfy the requirement ofDoes not exceed n.

B12. At F _sample In which a subset F of size k is determined _select I.e. F _select ∈F _sample And | F _select L = k, such thatSatisfies the following conditions:

wherein the T () function is a value evaluation function for evaluating the argument F _sub The expressed sampling subtree set judges the price of the conclusion of WebShellValue ", or understanding the assessment F _sub The amount of information that can contribute to the WebShell detection conclusion. The form and meaning of the T () function need to be customized, and the value evaluation function is defined as:

wherein, ω is ₁ 、ω ₂ 、ω ₃ The values are all set to be 1 in the scheme; sigma (),pi () is three value ranges [0,1]Respectively, for measuring F _select Coverage, suspicion, and variability. In the scheme, T () is the linear sum of three types of index values. Wherein:

coverage σ () function, in this scheme F is expected _select Contains as many nodes as possible in T to obtain more information of T. Thus, coverage is defined as F _select At the ratio of the size of the set of nodes to | V |, the formula:

wherein the content of the first and second substances,is composed ofA set of points.

Suspicion ofA function. Suppose thatAre two n-node sampled sub-trees of the abstract syntax tree T. If it isJust corresponding to the WebShell functional part in the source code,the corresponding malicious-free obfuscated code portion, in the WebShell detection problem,obvious ratioAnd has more suspiciousness. Similarly, for node v _i Ratio v _j More "suspicion degree", therefore, the suspicion degree of the node v is defined as:

wherein, c _v WebShell represents the number of occurrences of v in all WebShell scripts in the training set, c _v All indicates the number of times v occurs in all scripts in the training set. Defining n-node sampling subtree T _sample The suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:

accordingly, define F _select Is F _select Average value of suspicion degree of all n-node sampling subtrees:

a diversity pi () function. If it isAndnode types and structures are almost the same, then setIs unlikely to provide a ratioMore useful information. Therefore, F is desired _select The sampled subtrees in (1) are as dissimilar as possible. Definition F _select The diversity of (A) is:

wherein, the Tree _ Diversity () is a distance algorithm of the trees, and the distance between two trees is calculated.

The Tree distance algorithm Tree _ Diversity () is specifically as follows:

B13. to pairAll the sampled subtrees in (1), perform an m-ary tree transformation algorithm, denoted asThe m-ary tree transformation algorithm limits the size of the child nodes of any tree to m, and ensures that the size of the tree with any size of n is smaller than 2n after the m-ary tree transformation is carried out on the tree with any size of n. F is to be _transfer Instead of the abstract syntax tree T, as input to the detection module.

And (5) m-ary tree transformation algorithm. The idea is as follows: if the child node set C of the node v exceeds the size limit, namely | C | > m, a layer of filling nodes is added between v and C until the child node size of v meets the limit. The filling node only reduces the size of the child node, does not contain any syntactic semantic information, and is defined as a 0 vector when the vector is represented. The specific algorithm is as follows:

let T be _sample The sub-tree is sampled for n nodes, and the tree T is obtained after m-way tree transformation _transfer Obviously, the size of the child node is not larger than m. The filling nodes introduced in the m-ary tree transformation process are all T _transfer So that the number of padding nodes is necessarily less than | V _sample L, i.e. T _transfer Is less than 2|V _sample |。

At this point, the compression process for the abstract syntax tree is complete.

B2. Vectorized representation of tree nodes. The vectorization representation of the tree nodes adopts a One Hot Encoding (One Hot Encoding) method, the method is the most intuitive vectorization Encoding method, and the nodes v of the vectorization abstract syntax tree adopting the One Hot Encoding are marked as One _ Hot (v) to represent the node types of the v; and (3) adopting a Bag of words model (Bag of Word) to vector the abstract syntax tree T, and marking the abstract syntax tree T as BoW (T) to represent the number of each type of node in the T.

Let T be _transfer ＝{V _transfer ,E _transfer Is generated by an n-node sampling sub-tree of T = { V, E } through m-tree transformation. To V _transfer Of arbitrary non-filler nodes v, T and T _transfer Subtrees T with v as root node ^(v) Andin the vectorization process, the vectorized representation of v consists of two parts: the first part is used for representing the type of the node v, and adopts a one-hot coding mode, and is marked as:

Encode(v)＝one_hot(v)

the second part is used to representT ^(v) Is not sampled, andthe calculation formula of the node set of nodes that have not been "picked back" is:

for filler nodes, both partial representations are specified as 0 vectors.

And establishing a danger function list aiming at the script language, and searching whether the character string scalar node contains a danger function field or not by comparing the danger function list. And the risk function characteristics are vectorized and expressed by a bag-of-words model. The length of the feature vector is equal to the length of the list of the danger functions; the statistical characteristics of the character strings are interpreted from the mathematical point of view, and after the character strings are subjected to fuzzy processing such as confusion, encoding, encryption and the like, certain mathematical statistics of the character strings are usually deviated from the probability distribution of the character strings in a normal script. This is also the rationale for the NeoPi (an open source tool published by Neohapsis on github) method. The NeoPi is a script tool written by Python, detects malicious codes existing in texts and script files by using various statistical methods, and mainly detects the malicious codes by extracting the information entropy, the longest word, the coincidence index, the characteristics and the compression ratio of the files. The method selects 4 important indexes of the character string length, the coincidence index, the information entropy and the file compression ratio in the NeoPi method, and inspects each character string constant in the script.

String Length (Length of String). The string constants in the normal code are concise, and the code fragments are embedded into the string constants by part of WebShell, so that long strings are more likely to appear in WebShell scripts compared with normal scripts.

Coincidence Index (Index of Coincidence). The coincidence index is one way to determine whether a file is encrypted or encoded. The calculation formula is as follows:

IC(s)＝∑(f _i *f _i-1 )/N*(N-1)

wherein f is _i Representing the frequency of occurrence of the character i in the string s in the sample, and N is the length of the string. Statistics show that the coincidence index of a meaningful english text is 0.0667, while the index of a completely random english string is 0.0385. That is, when the coincidence index of an english string is close to 0.0385, we just tend to consider it encrypted or encoded, thereby further inferring that the script is likely to be WebShell.

Entropy of Information (Entropy of Information). Information entropy is a basic concept in information theory and is a measure of the degree of system ordering. The calculation formula is as follows:

H(s)＝-∑p _i *logp _i

wherein p is _i The proportion of the string i that appears in the string s. Therefore, when the character string is pseudo-randomized by encryption or coding, the information entropy increases, and therefore the larger the information entropy value is, the higher the possibility of WebShell is.

The file compression ratio, which is defined as the ratio of the uncompressed file size to the compressed file size. The essence of data compression is to eliminate the imbalance in the distribution of specific characters, and to achieve length optimization by assigning short codes to high frequency characters, while low frequency characters use long codes. A netpage document encoded in base64, with non-ASCII characters removed, will appear as a smaller distribution imbalance, with a greater compression ratio, calculated as follows:

wherein zip () represents compressing data and length () represents calculating data length.

C. A detection module (see fig. 4). The method adopts a deep neural network as a detection module, adopts a recursion cycle neural network aiming at a tree structure of an abstract syntax tree, and comprises the following concrete steps:

C1. aiming at the tree structure, the scheme defines a new neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, recursive Long Short Term Memory Layer). The basic idea of the recurve _ LSTM layer is: using the recursive nature of the tree, a vector representation of the tree is generated by some non-linear operation from vector representations of its root node and set of subtrees.

C2. The vectorized representation of the root node is the same as the vectorized representation of the tree nodes in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the long-short term memory layer. Formally, let the root node of the tree T = (V, E) be r, let the set of r child nodes be C = { C = { (C) } ₁ ,c ₂ ,…,c _i ,…,c _|c| Is a set of corresponding subtreesWherein c is _i Is composed ofThe root node of (2). The calculation formula for the T-vectorized representation is:

wherein, the first and the second end of the pipe are connected with each other,representing an activation function, W _root 、W _pickup And W _subtree Is a parameter. Encode (F) is the final output result of sequentially inputting the vectorized representation of each m-ary tree in F into the LSTM layer, and is represented as:

C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees represent intermediate nodes of the abstract syntax tree; 2) The fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:

c31.RRNN bottom by k sharing weight Recursive _ LSTM layer, corresponding processing k m-tree, through operation output k x d dimension characteristic, noted Feature ^R ＝[f ¹ ,f ² ,…,f ^k ] ^T 。

C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to F at the same time ^R Downsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature ^p ＝[f ^max ,f ^min ,f ^mean ] ^T 。

C32. Splicing Layer (conjugation Layer) F ^p And the vector f corresponding to the leaf feature ^s Spliced into a vector, feature ^A ＝f ^max &f ^min &f ^mean &f ^s 。(&Express splice mark)

The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, setting the accuracy as U, the recall rate as V, and the accuracy U = the number of correct extracted information pieces/the number of extracted information pieces, which is also called precision rate; recall V = number of correct pieces of information extracted/number of pieces of information in a sample, also known as recall. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. Generally, the accuracy Precision is how many retrieved items are accurate, and the Recall is how many all accurate items are retrieved. In practice, it is of course desirable that the search result Precision is as good as possible, and that the Recall is as good as possible, but in fact there may be some cases where these are contradictory. For example, in an extreme case, only one result is searched in the experiment and is accurate, then Precision is 100%, but Recall is very low; if all results are returned, then Recall is 100%, for example, but Precision will be low. Therefore, in different situations, it is necessary to judge whether a higher Precision or a higher Recall is desired.

The details of RRNN are shown in Table 1-1. In the training process, a binary cross information entropy function is adopted as a loss function, a method SGD (stochastic gradient descent) is adopted as a training method, the number of samples in each batch of training is 32, and the training iteration number is 1000.

TABLE 1-1 AST \uRRNN method detailed description of detection module parameters

The invention is further illustrated by the following examples.

Example (b):

the scheme adopts supervised training, and the mainstream method for training the deep neural network is a random gradient descent (SGD) method and a deformation form thereof. The method inputs a group of training samples into the neural network every time, and updates parameters of the neural network by using the value of the objective function until the value of the objective function is converged. The specific updating method is to move all parameters in the neural network by a small step along the direction of gradient decrease of the objective function (the opposite direction of the derivative).

The sample set of the example is selected, and the sample set contains a large number of normal scripts and 6669 WebShell scripts. 100000 scripts are drawn from the normal sample set for training the token's word vector. The remaining normal scripts are randomly extracted 6669, and form a training set of classification problems together with all WebShell scripts.

Tables 1-2 example training set and test set partitioning of data sets

	Training set	Test set	Total of
				WebShell script	5187	1482	6669
Normal script	5187	1482	6669

1) Firstly, using the lexical analysis results of 100000 PHP scripts as input;

2) Generating an abstract syntax tree by using the PHP-parser;

3) Determination of 4 key parameters in the sample generation module, (1)n: limiting the size of the tree; (2) m is the sub-node scale of the restriction tree; (3) k, sampling the size of the subtree set; (4) and k is the number of the m-ary trees which are finally input. During RRNN training, for any abstract syntax tree T = (V, E), when constructing samples, n is fixed to 1000, m is fixed to 10, k = min (50,) K = min (K, 10). And when the training of the RRNN model is finished, fixing 3 parameter values in each training process, respectively taking different values for the residual variables, and recording the detection result.

4) In the test process, it is found that the detection effect of the AST _ RRNN method is generally improved when the values of n, m, K, and K are increased. Therefore, in the detection process, the values of the 4 parameters can be properly increased according to the size of the abstract syntax tree, so that the detection accuracy is improved.

The AST _ RRNN method utilizes two types of features: (1) features extracted from leaf nodes; (2) features extracted from the abstract syntax tree. And on the basis of the trained RRNN, retraining parameters for adjusting the RRNN by using leaf node characteristics and abstract syntax tree characteristics respectively.

1) Accuracy 0.9886 using the leaf node features and abstract syntax tree features as inputs;

2) Accuracy 0.7649 when only leaf node characteristics are used;

3) Accuracy 0.8659 when only abstract syntax tree features are used;

the detection effect of the abstract syntax tree as the characteristic is obviously higher than that of the detection result of the leaf node as the characteristic, which shows that the structural information in the abstract syntax tree is important for WebShell detection. Moreover, only by using a single feature, both the abstract syntax tree and the leaf node can cause the accuracy rate to be reduced by at least 10%, the phenomenon is that the leaf node features describe key information of a data transmission part, the abstract syntax tree is an accurate description of a data execution part, and the two functions together guarantee the detection result of the AST _ RRNN method.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of this disclosure and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A WebShell detection method based on a deep neural network is characterized in that a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntactic information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the WebShell detection method comprises a preprocessing process, a sample generation process and a detection process; the method specifically comprises the following steps:

A. the script file preprocessing process comprises the following steps:

the input is script source code, the preprocessing comprises lexical analysis, syntactic analysis and simplification, and the output is abstract syntax tree T = (V, E);

B. and (3) a sample generation process:

inputting leaf nodes comprising the simplified abstract syntax tree and the abstract syntax tree; the method comprises the following steps: compressing the abstract syntax tree and vectorizing the abstract syntax tree, wherein the vectorized abstract syntax tree comprises vectorized representations of tree nodes and leaf nodes;

C. adopting a deep neural network to carry out WebShell detection: aiming at the tree structure of the abstract syntax tree, the deep neural network adopts a recursion cycle neural network; the method comprises the following steps:

C1. defining a neural network layer as a recursive long and short term memory layer aiming at a tree structure, wherein the recursive long and short term memory layer utilizes the recursive characteristic of a tree and is represented by vectors of a root node and a subtree set of the tree, and the vectors of the tree are represented by nonlinear operation;

C2. the vectorization representation method of the root node in the tree structure adopts the same method as the vectorization representation of the tree node in the step B; vector representation of a sub-tree set in the tree structure is generated by inputting sub-trees into a recursive long-short term memory layer in sequence;

C3. designing a recurrent neural network RRNN as a detection module by utilizing the recurrent long and short term memory layer;

the inputs to the RRNN include: k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; a fixed-length vector representing a leaf node of the abstract syntax tree; the operation process of the RRNN comprises the following steps:

the bottom of RRNN comprises k recursive long and short term memory layers sharing weight, corresponding to k m-trees, and outputting k x d dimensional characteristics through calculation, which is recorded as Feature ^R ＝[f ¹ ,f ² ,…,f ^k ] ^T ；

C32.RRNN pooling layer simultaneously uses three down-sampling functions of maximum value, minimum value and mean value to Feature ^R The down sampling operation is carried out according to the columns, and the pooling layer outputs three d-dimensional vectors which are marked as Feature ^p ＝[f ^max ,f ^min ,f ^mean ] ^T ；

RRNN splice layer Feature ^p And vector f corresponding to leaf feature ^s Spliced into a vector to obtain a spliced Feature vector Feature ^A ＝f ^max &f ^min &f ^mean &f ^s ，&Representing a splice;

c34.RRNN full connectivity layer utilizes Feature vector Feature ^A And performing WebShell judgment.

2. The WebShell detection method as recited in claim 1, wherein the step of preprocessing the script file specifically comprises:

A2. carrying out lexical analysis on the lexical unit flow to construct an abstract syntax tree;

A3. and filtering the lexical unit streams subjected to the syntactic analysis, and removing the semantic irrelevant information to fulfill the aim of simplifying the abstract syntax tree.

3. The WebShell detection method of claim 2, wherein the step A3 of simplifying the abstract syntax tree comprises the steps of:

A31. deleting all leaf nodes of the abstract syntax tree, simultaneously, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method when generating a sample, wherein the leaf node characteristics are not lost;

4. The WebShell detection method of claim 1, wherein the step B sample generation process comprises:

B1. compression of abstract syntax trees: limiting the size of the abstract syntax tree by using an n-node sampling sub-tree and an m-ary tree transformation method; vectorization representation of the leaf nodes can be completed by utilizing a characteristic engineering method;

B2. vectorization represents tree nodes: the vectorization coding method adopts a one-hot coding method, and adopts a node v of a one-hot coding vectorization abstract syntax tree, which is marked as one _ hot (v) and represents the node type of v; adopting a bag-of-words model vectorization abstract syntax tree T, recording as BoW (T), and representing the number of each type of node in T;

B3. vectorization represents a leaf node: and extracting the leaf nodes of the character string scalar type to obtain a danger function characteristic and a character string statistical characteristic.

5. The WebShell detection method as recited in claim 4, wherein the step B1 comprises the following steps:

B11. for any abstract syntax tree T = (V, E), repeatedly calling the n-node sampling subtree algorithm for K times, generating a sampling subtree set with the size of K, and recording the sampling subtree set as the size of KWherein for any Satisfy the requirement ofThe scale of (a) does not exceed n;

B12. at F _sample To obtain a subset F of size k _select I.e. F _select ∈F _sample And | F _select L = k, such thatSatisfies the following formula:

wherein the T () function is a value evaluation function for evaluating the argument F _sub The expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation F _sub The amount of information which can contribute to the WebShell detection conclusion is obtained; the value evaluation function T () is defined as follows:

wherein, ω is ₁ 、ω ₂ 、ω ₃ Is a constant; coverage function σ (), suspicion functionThe diversity function pi () has three ranges of [0,1 ]]Respectively, for measuring F _select Coverage, suspicion, and variability of; the cost evaluation function T () is a linear sum of three types of metric values.

6. The WebShell detection method of claim 5, wherein, in the definition of the value evaluation function T (),

wherein, ω is ₁ 、ω ₂ 、ω ₃ Is a constant; preferably, ω is ₁ 、ω ₂ 、ω ₃ Are all set to 1.

7. The WebShell detection method of claim 5, wherein the coverage function σ () is F _select Ratio of size of node set to | V |:

wherein the content of the first and second substances,is composed ofA set of points;

function of doubtabilityIs defined as follows: suppose thatIs two n-node sampling subtrees of the abstract syntax tree T; if it isJust corresponding to the WebShell functional part in the source code,and the corresponding part of the obfuscated code without malice is detected by the WebShell,thanMore has 'doubtful degree'; for node v _i Ratio v _j More has 'doubtful degree'; the suspiciousness of the node v is defined as:

wherein, c _v The WebShell represents the number of times that v appears in all WebShell scripts in the training set; c. C _v All means v is trainingThe occurrence frequency of all scripts is concentrated; defining n-node sampling subtree T _sample The suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:

the definition of the diversity function pi () is specifically: definition F _select The diversity of (a) is as follows:

wherein, the distance between two trees is calculated through Tree _ Diversity ().

8. The WebShell detection method as recited in claim 1, wherein in the step C2, assuming that a root node of a tree T = (V, E) is r, a set of child nodes in r is C = { C = ₁ ,c ₂ ,…,c _i ,…,c _|c| } the set of corresponding subtrees is F =Wherein c is _i Is composed ofA root node of; the tree T is represented vectorized by equation 1:

wherein the content of the first and second substances,() Representing an activation function; w _root 、W _pickup And W _subtree Is a parameter; the Encode (F) is the final output result of sequentially inputting the recursive long-short term memory layers, which is expressed by equation 2, as represented by the vectorization of each m-ary sub-tree in F.

9. The WebShell detection method of claim 1, wherein in step C34, the fully-connected layer utilizes Feature vectors Feature ^A Making a WebShell decision, in particular, giving a decision threshold when Feature ^A And when the judgment threshold is exceeded, identifying the file as the WebShell file.

10. The WebShell detection system realized by the WebShell detection method of claims 1-9 comprises a preprocessing module, a sample generation module and a detection module;

the preprocessing module takes script source codes as input by using a syntax analyzer and outputs an abstract syntax tree through syntax analysis;

the sample generation module is configured to translate an abstract syntax tree into vector expressions that facilitate training and prediction by a detection module, and includes: performing vectorization representation on leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method; limiting the scale of an abstract syntax tree part consisting of intermediate nodes through a sampling algorithm, and replacing an original abstract syntax tree by using a group of sampling subtrees with smaller scale;

the detection module is a deep neural network model, constructs a recurrent neural network, self-defines a recurrent long-term and short-term memory layer and provides bottom-up operation on the tree structure; the bottom end of the recurrent neural network is a recurrent _ LSTM layer with k shared parameters, the input of the recurrent _ LSTM layer is k tree structures, the operation result is spliced with the vector expression of the leaf nodes after being processed by a pooling layer, and finally the operation result is input into a full connection layer for WebShell detection.