CN107516041A - WebShell detection methods and its system based on deep neural network - Google Patents

WebShell detection methods and its system based on deep neural network Download PDF

Info

Publication number
CN107516041A
CN107516041A CN201710705914.1A CN201710705914A CN107516041A CN 107516041 A CN107516041 A CN 107516041A CN 201710705914 A CN201710705914 A CN 201710705914A CN 107516041 A CN107516041 A CN 107516041A
Authority
CN
China
Prior art keywords
tree
abstract syntax
webshell
syntax tree
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710705914.1A
Other languages
Chinese (zh)
Other versions
CN107516041B (en
Inventor
张涛
齐龙晨
宁戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing An Punuo Information Technology Co Ltd
Original Assignee
Beijing An Punuo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing An Punuo Information Technology Co Ltd filed Critical Beijing An Punuo Information Technology Co Ltd
Priority to CN201710705914.1A priority Critical patent/CN107516041B/en
Publication of CN107516041A publication Critical patent/CN107516041A/en
Application granted granted Critical
Publication of CN107516041B publication Critical patent/CN107516041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of WebShell detection methods and its system based on deep neural network, recursion cycle neutral net based on abstract syntax tree, obtain morphology, the syntactic information of script automatically for script, feature extraction and WebShell detections, including pretreatment, sample generation and WebShell detections are completed using the hierarchical structure feature of abstract syntax tree;Obtain morphology, the syntactic information of script automatically first, recycle the recursion cycle neutral net based on abstract syntax tree to complete feature extraction and WebShell detections.The lower deployment cost of the inventive method is low, portability is good, Detection accuracy is high.

Description

WebShell detection method and system based on deep neural network
Technical Field
The invention relates to the technical field of information security, in particular to a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree.
Background
WebShell is a command execution environment in the form of a web page, often used by intruders as a backdoor tool for operating web servers. An attacker obtains the management authority of the Web service through the WebShell, so that penetration and control on Web application are achieved.
Since the characteristics of the WebShell and the common Web page are almost consistent, the detection of the traditional firewall and antivirus software can be avoided. And with the application of various feature confusion hiding technologies for anti-detection to WebShell, a traditional detection mode based on feature code matching is difficult to detect new variants in time.
From the attacker perspective, webShell is a script Trojan backdoor written by asp, aspx, php or jsp and the like. After an attacker invades a website, the script files are often uploaded to a Web server directory. Through the browser access mode, the Web server can be controlled, such as reading website database data and deleting files on the website server, while accessing the script file, and even the system branch command can be directly operated if the Web authority is higher.
The existing WebShell detection methods are white box detection, namely detection is carried out on a source code of a WebShell script file, and the existing WebShell detection methods can be specifically divided into two types of detection based on a host and detection based on a network.
Host-based detection: among these methods, the detection method more common in the industry is to directly use known keywords as features, search suspicious files by grep sentences and then manually analyze the suspicious files, or periodically check the MD5 value of existing files and check whether new files are generated by using a program. This intuitive detection method is easily circumvented by attackers using obfuscation means.
Network-based detection: the current existing method mainly focuses on configuring an intrusion detection system, namely a WAF, at a network entrance to detect WebShell, and judges whether an attacker uploads HTML or script files by analyzing whether special keywords (e.g., < form, <%, <. The method needs large expenditure and also has the possibility of false alarm; and only the behavior of uploading WebShell can be detected, but the existing WebShell cannot be detected.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of the script mainly through the program language processing technology, and completes feature extraction and WebShell detection by utilizing a deep neural network at the same time, wherein the method is mainly aimed at mainstream script languages, including PHP, javaScript, perl, python, ruby and the like. The system mainly comprises three modules, namely a preprocessing module utilizing a programming language processing technology, a sample generation module completing vectorization expression and a detection module utilizing a deep learning technology. The method has the advantages of low deployment cost, good transportability and high detection accuracy.
The following are several typical neural network model-related term definitions:
the operational formula of the neural network can be defined as:
wherein o is (i) Output vector, o, representing the i-th layer of the neural network (i) The dimension of (a) is the number of neurons (number of network nodes); x is a radical of a fluorine atom (i) Is the output of the i-1 th layer of the network and is used as the input of the i-th layer; w (i) And b (i) Is a parameter of the i-th layer of the neural network; balanceFor activating functions, in generalIs a non-linear function. This neural network layer is called a fully connected layer (Full Connection layer).
A Recurrent Neural Network (RNN) is used to process sequence inputs. The recurrent neural network processes one input sequence element at a time while maintaining the historical state of all past time sequence elements with one hidden unit.
The calculation formula of the recurrent neural network layer is as follows:
wherein x is t Is the input vector, S t Is a hidden unit vector, o t Is the output vector; w, U, V is a parameter,is an activation function.
The first occurrence of the Pooling Layer (Pooling Layer) in convolutional neural networks was a down-sampling window that slides through the input matrix. And the pooling layer performs down-sampling in the corresponding matrix sub-area according to the sampling function each time, slides to the next position according to the specified step until the sampling in the whole input matrix is finished, and finally outputs the sampling result matrix to the next layer. The most common sampling methods are maximum sampling, minimum sampling and mean sampling.
The splicing Layer (splicing Layer) is responsible for merging the input k vectors into one output vector, namely:
o=i 1 &i 2 &,…,&i k in which&Are hyphenated.
The technical scheme provided by the invention is as follows:
a WebShell detection method based on deep Neural Network, based on the Recurrent Neural Network (AST _ RRNN, recurrent Neural Network based on abstract syntax tree), utilize the hierarchical structural feature of the abstract syntax tree, carry on WebShell to the script language of the mainstream; the method comprises a pretreatment process, a sample generation process and a detection process; the method specifically comprises the following steps (the scheme flow diagram is specifically shown in figure 1):
A. firstly, preprocessing a script file: the preprocessing module comprises a lexical analyzer, a syntax analyzer and a simplification module, wherein the input is script source codes, and the output is an abstract syntax tree AST (abstract syntax tree), and the specific steps are as follows:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. the lexical analyzer analyzes the lexical unit flow to construct an abstract syntax tree;
A3. and simplifying the steps, namely filtering out semantic irrelevant information such as comments and the like after the step of syntactic analysis.
B. And (4) generating a sample. The sample generation module of the WebShell detection method of AST _ RRNN comprises two parts of input contents: simplified AST and AST leaf nodes. However, since the difference in the size of the abstract syntax tree (the number of nodes in the tree) adversely affects the training and prediction of the detection module, the abstract syntax tree needs to be compressed before vectorization. And the sample generation module is responsible for converting the abstract syntax tree into a vectorized representation that facilitates training and prediction by the detection module. The compression steps of the abstract syntax tree are as follows:
B1. the compression of abstract syntax tree mainly utilizes the concepts and methods of n-node sampling sub-tree and m-branch tree transformation to limit the size of abstract syntax tree, and in addition, simple characteristic engineering method is utilized to complete the vectorization representation of leaf nodes.
B2. Vectorized representation of tree nodes. Adopting a One Hot Encoding (One Hot Encoding) method as the most intuitive vectorization Encoding method, and adopting the One Hot Encoding to vectorize a node v of an abstract syntax tree, marking as One _ Hot (v), and representing the node type of the v; and (3) adopting a Bag of words model (Bag of Word) to vector the abstract syntax tree T, and marking the abstract syntax tree T as BoW (T) to represent the number of each type of node in the T.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
C. And (3) detection process: and a deep neural network is adopted as a detection module. And aiming at the tree structure of the abstract syntax tree, a recursive cyclic neural network is adopted. The method comprises the following steps:
C1. aiming at the tree structure, the scheme defines a neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, recursive Long Short Term Memory Layer). The recursion _ LSTM layer exploits the Recursive nature of trees whose vector representation is generated by some non-linear operation from the vector representations of their root nodes and subtree sets.
C2. The vectorized representation of the root node in the tree structure is the same as the vectorized representation of the tree node in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the recursive long-short term memory layer. Assuming that the root node of the tree T = (V, E) is r, let the set of r child nodes be C = { C = { (C) } 1 ,c 2 ,…,c i ,…,c |c| Is a set of corresponding subtreesWherein c is i Is composed ofThe root node of (2). The calculation formula of the T vectorization expression is formula 1:
wherein the content of the first and second substances,representing an activation function, W root 、W pickup And W subtree Is a parameter. Encode (F) is the final output result of sequentially inputting the recurive _ LSTM layers by vectorization representation of each m-ary tree in F, and is expressed as formula 2:
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; 2) And the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation process of RRNN is described as follows:
c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as Feature R =[f 1 ,f 2 ,…,f k ] T
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to process Feature R Downsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature p =[f max ,f min ,f mean ] T
C33. Splicing Layer (ligation Layer) will Feature p And the vector f corresponding to the leaf feature s Spliced into a vector, feature A =f max &f min &f mean &f s ,(&A representation tile); feature of A For information entropy, longest word,
Representing a coincidence index, a compression ratio and a danger function;
C34. subsequent full connectivity layer utilization Feature A And performing WebShell judgment.
Specifically, given a decision threshold, when Feature A And when the judgment threshold limit is exceeded, the file is identified as the WebShell file.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, the accuracy ratio Precision is set as U, the Recall ratio Recall is set as V, and the accuracy ratio U = the number of correct extracted information pieces/the number of extracted information pieces, which is also called Precision ratio; recall V = number of correct pieces of information extracted/number of pieces of information in a sample, also known as recall. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. In the implementation of the invention, a Precision-Recall curve is drawn to help the adjustment analysis and the parameters are selected.
The invention discloses a recursive recurrent neural network detection method based on an abstract syntax tree. First, the script code is converted into an abstract syntax tree using a lexical analyzer. Then, a compression algorithm of the abstract syntax tree is invented. And finally, aiming at the structural characteristics of the abstract syntax tree, the invention provides a recurrent neural network model as a detection module.
In another aspect, the present invention further provides a WebShell detection system of a recurrent neural network based on an abstract syntax tree, where the system includes:
1. the preprocessing module is used for outputting an abstract syntax tree by using a syntax analyzer as a preprocessing module and using script source codes as input through syntax analysis;
2. the sample generation module, the sample generation module of the AST _ RRNN method, is responsible for converting the abstract syntax tree into a vector representation that facilitates the training and prediction of the detection module. The vectorized representation of the abstract syntax tree is divided into two parts: 1) And performing vectorization representation on the leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method. 2) Designing a sampling algorithm to limit the scale of an abstract syntax tree part consisting of intermediate nodes, wherein the basic idea is to replace an original abstract syntax tree by utilizing a group of sampling subtrees with smaller scale;
3. the detection module is a deep neural network model, and constructs a recurrent neural network (RRNN). The user-defined Recursive long-short term memory layer recursion _ LSTM provides a bottom-up operation mode for the tree structure. Because the input of the RRNN is k tree structures, the bottom end of the RRNN is k recurrent _ LSTM layers sharing parameters, the operation result is spliced with the vector expression of the leaf nodes after being processed by the pooling layer, and finally the operation result is input into a subsequent full connection layer.
The invention has the beneficial effects that:
the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The method simultaneously introduces a program language processing technology and a deep learning technology, automatically acquires lexical and grammatical information of the script by the program language processing technology aiming at mainstream script languages including PHP, javaScript, perl, python, ruby and the like, and completes feature extraction and WebShell detection by utilizing a deep neural network. The WebShell detection by using the technical scheme provided by the invention has the following advantages:
1) The features are automatically extracted, and the dependence on feature engineering is avoided;
2) The portability is good, and the thought and the flow are suitable for any scripting language;
3) The static detection method can be deployed at a Web server side in a light weight mode, and is low in deployment and detection cost;
4) The detection accuracy is high: compared with various test modes, the WebShell detection method based on the recurrent neural network of the abstract syntax tree can effectively deal with some relatively new WebShell type files (such as 0day WebShell), and has good searching and killing effects on some deformed, encrypted and existing WebShell files.
Drawings
Fig. 1 is a flow chart of the WebShell detection method provided by the present invention.
Fig. 2 is a block diagram of a flow of a preprocessing module in the WebShell file detection process in the embodiment of the present invention.
Fig. 3 is a block flow diagram of a sample generation module in the WebShell file detection process in the embodiment of the present invention.
Fig. 4 is a block diagram of a flow of a detection module in the WebShell file detection process in the embodiment of the present invention.
Fig. 5 is a block diagram of the system provided by the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a method and a system for detecting a WebShell (WebShell) based on a recurrent neural network of an abstract syntax tree, wherein the system comprises a preprocessing module, a sample generating module and a detecting module; the detection of the WebShell file of the website is realized through the above process, and the specific implementation mode of the invention is as follows (in this example, PHP script is taken as an example for explanation, and other types of script languages are the same):
A. the preprocessing module, this part includes lexical analyzer, syntax analyzer, simplification step, specifically as follows (as fig. 2):
A1. a lexical analyzer, a PHP type file F containing program codes (script source codes), and generating a lexical unit stream WS after the lexical analysis;
A2. and (4) utilizing a syntax analyzer PHP-parser to perform syntax analysis on the WS to construct an abstract syntax tree AST.
The parsing process typically filters out semantically irrelevant information, such as annotations. The syntactic analysis is based on lexical analysis, which is more rigorous than the lexical analysis rules. Meanwhile, compared with the lexical unit stream, the abstract syntax tree can reflect the code structure information more accurately.
A3. The abstract syntax tree is simplified, the structure of the abstract syntax tree after PHP-pharse analysis is clear, but the abstract syntax tree is slightly redundant, the abstract syntax tree needs to be simplified, and the simplification steps are as follows:
A31. deleting all leaf nodes of the abstract syntax tree, and simultaneously, in order to not lose leaf node characteristics, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method through a sample generation module;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
B. Sample generation module (as detailed in fig. 3):
B1. in the compression of the abstract syntax tree, since the difference in the size of the abstract syntax tree (the nodes of the tree) adversely affects the training and prediction of the detection module, it is necessary to compress the abstract syntax tree before vectorizing the abstract syntax tree. The abstract syntax tree is limited in size mainly by the concept and method of n-node sampling sub-tree and m-tree transformation. The specific compression steps are as follows:
B11. for any abstract syntax tree T = (V, E), repeatedly calling the n-node sampling subtree algorithm for K times. The result returned by the n-node sampling subtree algorithm is called a sampling subtree. This step therefore ultimately produces a set of sampled subtrees of size K, denoted as Wherein for anySatisfy the requirement ofDoes not exceed n.
B12. At F sample In which a subset F of size k is determined select I.e. F select ∈F sample And | F select L = k, such thatSatisfies the following conditions:
wherein the T () function is a value evaluation function for evaluating the argument F sub The expressed sampling subtree set judges the price of the conclusion of WebShellValue ", or understanding the assessment F sub The amount of information that can contribute to the WebShell detection conclusion. The form and meaning of the T () function need to be customized, and the value evaluation function is defined as:
wherein, ω is 1 、ω 2 、ω 3 The values are all set to be 1 in the scheme; sigma (),pi () is three value ranges [0,1]Respectively, for measuring F select Coverage, suspicion, and variability. In the scheme, T () is the linear sum of three types of index values. Wherein:
coverage σ () function, in this scheme F is expected select Contains as many nodes as possible in T to obtain more information of T. Thus, coverage is defined as F select At the ratio of the size of the set of nodes to | V |, the formula:
wherein the content of the first and second substances,is composed ofA set of points.
Suspicion ofA function. Suppose thatAre two n-node sampled sub-trees of the abstract syntax tree T. If it isJust corresponding to the WebShell functional part in the source code,the corresponding malicious-free obfuscated code portion, in the WebShell detection problem,obvious ratioAnd has more suspiciousness. Similarly, for node v i Ratio v j More "suspicion degree", therefore, the suspicion degree of the node v is defined as:
wherein, c v WebShell represents the number of occurrences of v in all WebShell scripts in the training set, c v All indicates the number of times v occurs in all scripts in the training set. Defining n-node sampling subtree T sample The suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
accordingly, define F select Is F select Average value of suspicion degree of all n-node sampling subtrees:
a diversity pi () function. If it isAndnode types and structures are almost the same, then setIs unlikely to provide a ratioMore useful information. Therefore, F is desired select The sampled subtrees in (1) are as dissimilar as possible. Definition F select The diversity of (A) is:
wherein, the Tree _ Diversity () is a distance algorithm of the trees, and the distance between two trees is calculated.
The Tree distance algorithm Tree _ Diversity () is specifically as follows:
B13. to pairAll the sampled subtrees in (1), perform an m-ary tree transformation algorithm, denoted asThe m-ary tree transformation algorithm limits the size of the child nodes of any tree to m, and ensures that the size of the tree with any size of n is smaller than 2n after the m-ary tree transformation is carried out on the tree with any size of n. F is to be transfer Instead of the abstract syntax tree T, as input to the detection module.
And (5) m-ary tree transformation algorithm. The idea is as follows: if the child node set C of the node v exceeds the size limit, namely | C | > m, a layer of filling nodes is added between v and C until the child node size of v meets the limit. The filling node only reduces the size of the child node, does not contain any syntactic semantic information, and is defined as a 0 vector when the vector is represented. The specific algorithm is as follows:
let T be sample The sub-tree is sampled for n nodes, and the tree T is obtained after m-way tree transformation transfer Obviously, the size of the child node is not larger than m. The filling nodes introduced in the m-ary tree transformation process are all T transfer So that the number of padding nodes is necessarily less than | V sample L, i.e. T transfer Is less than 2|V sample |。
At this point, the compression process for the abstract syntax tree is complete.
B2. Vectorized representation of tree nodes. The vectorization representation of the tree nodes adopts a One Hot Encoding (One Hot Encoding) method, the method is the most intuitive vectorization Encoding method, and the nodes v of the vectorization abstract syntax tree adopting the One Hot Encoding are marked as One _ Hot (v) to represent the node types of the v; and (3) adopting a Bag of words model (Bag of Word) to vector the abstract syntax tree T, and marking the abstract syntax tree T as BoW (T) to represent the number of each type of node in the T.
Let T be transfer ={V transfer ,E transfer Is generated by an n-node sampling sub-tree of T = { V, E } through m-tree transformation. To V transfer Of arbitrary non-filler nodes v, T and T transfer Subtrees T with v as root node (v) Andin the vectorization process, the vectorized representation of v consists of two parts: the first part is used for representing the type of the node v, and adopts a one-hot coding mode, and is marked as:
Encode(v)=one_hot(v)
the second part is used to representT (v) Is not sampled, andthe calculation formula of the node set of nodes that have not been "picked back" is:
for filler nodes, both partial representations are specified as 0 vectors.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
And establishing a danger function list aiming at the script language, and searching whether the character string scalar node contains a danger function field or not by comparing the danger function list. And the risk function characteristics are vectorized and expressed by a bag-of-words model. The length of the feature vector is equal to the length of the list of the danger functions; the statistical characteristics of the character strings are interpreted from the mathematical point of view, and after the character strings are subjected to fuzzy processing such as confusion, encoding, encryption and the like, certain mathematical statistics of the character strings are usually deviated from the probability distribution of the character strings in a normal script. This is also the rationale for the NeoPi (an open source tool published by Neohapsis on github) method. The NeoPi is a script tool written by Python, detects malicious codes existing in texts and script files by using various statistical methods, and mainly detects the malicious codes by extracting the information entropy, the longest word, the coincidence index, the characteristics and the compression ratio of the files. The method selects 4 important indexes of the character string length, the coincidence index, the information entropy and the file compression ratio in the NeoPi method, and inspects each character string constant in the script.
String Length (Length of String). The string constants in the normal code are concise, and the code fragments are embedded into the string constants by part of WebShell, so that long strings are more likely to appear in WebShell scripts compared with normal scripts.
Coincidence Index (Index of Coincidence). The coincidence index is one way to determine whether a file is encrypted or encoded. The calculation formula is as follows:
IC(s)=∑(f i *f i-1 )/N*(N-1)
wherein f is i Representing the frequency of occurrence of the character i in the string s in the sample, and N is the length of the string. Statistics show that the coincidence index of a meaningful english text is 0.0667, while the index of a completely random english string is 0.0385. That is, when the coincidence index of an english string is close to 0.0385, we just tend to consider it encrypted or encoded, thereby further inferring that the script is likely to be WebShell.
Entropy of Information (Entropy of Information). Information entropy is a basic concept in information theory and is a measure of the degree of system ordering. The calculation formula is as follows:
H(s)=-∑p i *logp i
wherein p is i The proportion of the string i that appears in the string s. Therefore, when the character string is pseudo-randomized by encryption or coding, the information entropy increases, and therefore the larger the information entropy value is, the higher the possibility of WebShell is.
The file compression ratio, which is defined as the ratio of the uncompressed file size to the compressed file size. The essence of data compression is to eliminate the imbalance in the distribution of specific characters, and to achieve length optimization by assigning short codes to high frequency characters, while low frequency characters use long codes. A netpage document encoded in base64, with non-ASCII characters removed, will appear as a smaller distribution imbalance, with a greater compression ratio, calculated as follows:
wherein zip () represents compressing data and length () represents calculating data length.
C. A detection module (see fig. 4). The method adopts a deep neural network as a detection module, adopts a recursion cycle neural network aiming at a tree structure of an abstract syntax tree, and comprises the following concrete steps:
C1. aiming at the tree structure, the scheme defines a new neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, recursive Long Short Term Memory Layer). The basic idea of the recurve _ LSTM layer is: using the recursive nature of the tree, a vector representation of the tree is generated by some non-linear operation from vector representations of its root node and set of subtrees.
C2. The vectorized representation of the root node is the same as the vectorized representation of the tree nodes in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the long-short term memory layer. Formally, let the root node of the tree T = (V, E) be r, let the set of r child nodes be C = { C = { (C) } 1 ,c 2 ,…,c i ,…,c |c| Is a set of corresponding subtreesWherein c is i Is composed ofThe root node of (2). The calculation formula for the T-vectorized representation is:
wherein, the first and the second end of the pipe are connected with each other,representing an activation function, W root 、W pickup And W subtree Is a parameter. Encode (F) is the final output result of sequentially inputting the vectorized representation of each m-ary tree in F into the LSTM layer, and is represented as:
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees represent intermediate nodes of the abstract syntax tree; 2) The fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:
c31.RRNN bottom by k sharing weight Recursive _ LSTM layer, corresponding processing k m-tree, through operation output k x d dimension characteristic, noted Feature R =[f 1 ,f 2 ,…,f k ] T
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to F at the same time R Downsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Feature p =[f max ,f min ,f mean ] T
C32. Splicing Layer (conjugation Layer) F p And the vector f corresponding to the leaf feature s Spliced into a vector, feature A =f max &f min &f mean &f s 。(&Express splice mark)
C34. Subsequent full connectivity layer utilization Feature A And performing WebShell judgment.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, setting the accuracy as U, the recall rate as V, and the accuracy U = the number of correct extracted information pieces/the number of extracted information pieces, which is also called precision rate; recall V = number of correct pieces of information extracted/number of pieces of information in a sample, also known as recall. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. Generally, the accuracy Precision is how many retrieved items are accurate, and the Recall is how many all accurate items are retrieved. In practice, it is of course desirable that the search result Precision is as good as possible, and that the Recall is as good as possible, but in fact there may be some cases where these are contradictory. For example, in an extreme case, only one result is searched in the experiment and is accurate, then Precision is 100%, but Recall is very low; if all results are returned, then Recall is 100%, for example, but Precision will be low. Therefore, in different situations, it is necessary to judge whether a higher Precision or a higher Recall is desired.
The details of RRNN are shown in Table 1-1. In the training process, a binary cross information entropy function is adopted as a loss function, a method SGD (stochastic gradient descent) is adopted as a training method, the number of samples in each batch of training is 32, and the training iteration number is 1000.
TABLE 1-1 AST \uRRNN method detailed description of detection module parameters
The invention is further illustrated by the following examples.
Example (b):
the scheme adopts supervised training, and the mainstream method for training the deep neural network is a random gradient descent (SGD) method and a deformation form thereof. The method inputs a group of training samples into the neural network every time, and updates parameters of the neural network by using the value of the objective function until the value of the objective function is converged. The specific updating method is to move all parameters in the neural network by a small step along the direction of gradient decrease of the objective function (the opposite direction of the derivative).
The sample set of the example is selected, and the sample set contains a large number of normal scripts and 6669 WebShell scripts. 100000 scripts are drawn from the normal sample set for training the token's word vector. The remaining normal scripts are randomly extracted 6669, and form a training set of classification problems together with all WebShell scripts.
Tables 1-2 example training set and test set partitioning of data sets
Training set Test set Total of
WebShell script 5187 1482 6669
Normal script 5187 1482 6669
1) Firstly, using the lexical analysis results of 100000 PHP scripts as input;
2) Generating an abstract syntax tree by using the PHP-parser;
3) Determination of 4 key parameters in the sample generation module, (1)n: limiting the size of the tree; (2) m is the sub-node scale of the restriction tree; (3) k, sampling the size of the subtree set; (4) and k is the number of the m-ary trees which are finally input. During RRNN training, for any abstract syntax tree T = (V, E), when constructing samples, n is fixed to 1000, m is fixed to 10, k = min (50,) K = min (K, 10). And when the training of the RRNN model is finished, fixing 3 parameter values in each training process, respectively taking different values for the residual variables, and recording the detection result.
4) In the test process, it is found that the detection effect of the AST _ RRNN method is generally improved when the values of n, m, K, and K are increased. Therefore, in the detection process, the values of the 4 parameters can be properly increased according to the size of the abstract syntax tree, so that the detection accuracy is improved.
The AST _ RRNN method utilizes two types of features: (1) features extracted from leaf nodes; (2) features extracted from the abstract syntax tree. And on the basis of the trained RRNN, retraining parameters for adjusting the RRNN by using leaf node characteristics and abstract syntax tree characteristics respectively.
1) Accuracy 0.9886 using the leaf node features and abstract syntax tree features as inputs;
2) Accuracy 0.7649 when only leaf node characteristics are used;
3) Accuracy 0.8659 when only abstract syntax tree features are used;
the detection effect of the abstract syntax tree as the characteristic is obviously higher than that of the detection result of the leaf node as the characteristic, which shows that the structural information in the abstract syntax tree is important for WebShell detection. Moreover, only by using a single feature, both the abstract syntax tree and the leaf node can cause the accuracy rate to be reduced by at least 10%, the phenomenon is that the leaf node features describe key information of a data transmission part, the abstract syntax tree is an accurate description of a data execution part, and the two functions together guarantee the detection result of the AST _ RRNN method.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of this disclosure and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (10)

1. A WebShell detection method based on a deep neural network is characterized in that a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntactic information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the WebShell detection method comprises a preprocessing process, a sample generation process and a detection process; the method specifically comprises the following steps:
A. the script file preprocessing process comprises the following steps:
the input is script source code, the preprocessing comprises lexical analysis, syntactic analysis and simplification, and the output is abstract syntax tree T = (V, E);
B. and (3) a sample generation process:
inputting leaf nodes comprising the simplified abstract syntax tree and the abstract syntax tree; the method comprises the following steps: compressing the abstract syntax tree and vectorizing the abstract syntax tree, wherein the vectorized abstract syntax tree comprises vectorized representations of tree nodes and leaf nodes;
C. adopting a deep neural network to carry out WebShell detection: aiming at the tree structure of the abstract syntax tree, the deep neural network adopts a recursion cycle neural network; the method comprises the following steps:
C1. defining a neural network layer as a recursive long and short term memory layer aiming at a tree structure, wherein the recursive long and short term memory layer utilizes the recursive characteristic of a tree and is represented by vectors of a root node and a subtree set of the tree, and the vectors of the tree are represented by nonlinear operation;
C2. the vectorization representation method of the root node in the tree structure adopts the same method as the vectorization representation of the tree node in the step B; vector representation of a sub-tree set in the tree structure is generated by inputting sub-trees into a recursive long-short term memory layer in sequence;
C3. designing a recurrent neural network RRNN as a detection module by utilizing the recurrent long and short term memory layer;
the inputs to the RRNN include: k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; a fixed-length vector representing a leaf node of the abstract syntax tree; the operation process of the RRNN comprises the following steps:
the bottom of RRNN comprises k recursive long and short term memory layers sharing weight, corresponding to k m-trees, and outputting k x d dimensional characteristics through calculation, which is recorded as Feature R =[f 1 ,f 2 ,…,f k ] T
C32.RRNN pooling layer simultaneously uses three down-sampling functions of maximum value, minimum value and mean value to Feature R The down sampling operation is carried out according to the columns, and the pooling layer outputs three d-dimensional vectors which are marked as Feature p =[f max ,f min ,f mean ] T
RRNN splice layer Feature p And vector f corresponding to leaf feature s Spliced into a vector to obtain a spliced Feature vector Feature A =f max &f min &f mean &f s ,&Representing a splice;
c34.RRNN full connectivity layer utilizes Feature vector Feature A And performing WebShell judgment.
2. The WebShell detection method as recited in claim 1, wherein the step of preprocessing the script file specifically comprises:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. carrying out lexical analysis on the lexical unit flow to construct an abstract syntax tree;
A3. and filtering the lexical unit streams subjected to the syntactic analysis, and removing the semantic irrelevant information to fulfill the aim of simplifying the abstract syntax tree.
3. The WebShell detection method of claim 2, wherein the step A3 of simplifying the abstract syntax tree comprises the steps of:
A31. deleting all leaf nodes of the abstract syntax tree, simultaneously, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method when generating a sample, wherein the leaf node characteristics are not lost;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
4. The WebShell detection method of claim 1, wherein the step B sample generation process comprises:
B1. compression of abstract syntax trees: limiting the size of the abstract syntax tree by using an n-node sampling sub-tree and an m-ary tree transformation method; vectorization representation of the leaf nodes can be completed by utilizing a characteristic engineering method;
B2. vectorization represents tree nodes: the vectorization coding method adopts a one-hot coding method, and adopts a node v of a one-hot coding vectorization abstract syntax tree, which is marked as one _ hot (v) and represents the node type of v; adopting a bag-of-words model vectorization abstract syntax tree T, recording as BoW (T), and representing the number of each type of node in T;
B3. vectorization represents a leaf node: and extracting the leaf nodes of the character string scalar type to obtain a danger function characteristic and a character string statistical characteristic.
5. The WebShell detection method as recited in claim 4, wherein the step B1 comprises the following steps:
B11. for any abstract syntax tree T = (V, E), repeatedly calling the n-node sampling subtree algorithm for K times, generating a sampling subtree set with the size of K, and recording the sampling subtree set as the size of KWherein for any Satisfy the requirement ofThe scale of (a) does not exceed n;
B12. at F sample To obtain a subset F of size k select I.e. F select ∈F sample And | F select L = k, such thatSatisfies the following formula:
wherein the T () function is a value evaluation function for evaluating the argument F sub The expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation F sub The amount of information which can contribute to the WebShell detection conclusion is obtained; the value evaluation function T () is defined as follows:
wherein, ω is 1 、ω 2 、ω 3 Is a constant; coverage function σ (), suspicion functionThe diversity function pi () has three ranges of [0,1 ]]Respectively, for measuring F select Coverage, suspicion, and variability of; the cost evaluation function T () is a linear sum of three types of metric values.
6. The WebShell detection method of claim 5, wherein, in the definition of the value evaluation function T (),
wherein, ω is 1 、ω 2 、ω 3 Is a constant; preferably, ω is 1 、ω 2 、ω 3 Are all set to 1.
7. The WebShell detection method of claim 5, wherein the coverage function σ () is F select Ratio of size of node set to | V |:
wherein the content of the first and second substances,is composed ofA set of points;
function of doubtabilityIs defined as follows: suppose thatIs two n-node sampling subtrees of the abstract syntax tree T; if it isJust corresponding to the WebShell functional part in the source code,and the corresponding part of the obfuscated code without malice is detected by the WebShell,thanMore has 'doubtful degree'; for node v i Ratio v j More has 'doubtful degree'; the suspiciousness of the node v is defined as:
wherein, c v The WebShell represents the number of times that v appears in all WebShell scripts in the training set; c. C v All means v is trainingThe occurrence frequency of all scripts is concentrated; defining n-node sampling subtree T sample The suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
accordingly, define F select Is F select Average value of suspicion degree of all n-node sampling subtrees:
the definition of the diversity function pi () is specifically: definition F select The diversity of (a) is as follows:
wherein, the distance between two trees is calculated through Tree _ Diversity ().
8. The WebShell detection method as recited in claim 1, wherein in the step C2, assuming that a root node of a tree T = (V, E) is r, a set of child nodes in r is C = { C = 1 ,c 2 ,…,c i ,…,c |c| } the set of corresponding subtrees is F =Wherein c is i Is composed ofA root node of; the tree T is represented vectorized by equation 1:
wherein the content of the first and second substances,() Representing an activation function; w root 、W pickup And W subtree Is a parameter; the Encode (F) is the final output result of sequentially inputting the recursive long-short term memory layers, which is expressed by equation 2, as represented by the vectorization of each m-ary sub-tree in F.
9. The WebShell detection method of claim 1, wherein in step C34, the fully-connected layer utilizes Feature vectors Feature A Making a WebShell decision, in particular, giving a decision threshold when Feature A And when the judgment threshold is exceeded, identifying the file as the WebShell file.
10. The WebShell detection system realized by the WebShell detection method of claims 1-9 comprises a preprocessing module, a sample generation module and a detection module;
the preprocessing module takes script source codes as input by using a syntax analyzer and outputs an abstract syntax tree through syntax analysis;
the sample generation module is configured to translate an abstract syntax tree into vector expressions that facilitate training and prediction by a detection module, and includes: performing vectorization representation on leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method; limiting the scale of an abstract syntax tree part consisting of intermediate nodes through a sampling algorithm, and replacing an original abstract syntax tree by using a group of sampling subtrees with smaller scale;
the detection module is a deep neural network model, constructs a recurrent neural network, self-defines a recurrent long-term and short-term memory layer and provides bottom-up operation on the tree structure; the bottom end of the recurrent neural network is a recurrent _ LSTM layer with k shared parameters, the input of the recurrent _ LSTM layer is k tree structures, the operation result is spliced with the vector expression of the leaf nodes after being processed by a pooling layer, and finally the operation result is input into a full connection layer for WebShell detection.
CN201710705914.1A 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network Active CN107516041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710705914.1A CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710705914.1A CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Publications (2)

Publication Number Publication Date
CN107516041A true CN107516041A (en) 2017-12-26
CN107516041B CN107516041B (en) 2020-04-03

Family

ID=60723188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710705914.1A Active CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Country Status (1)

Country Link
CN (1) CN107516041B (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376283A (en) * 2018-01-08 2018-08-07 中国科学院计算技术研究所 Pond makeup for neural network is set and pond method
CN108388425A (en) * 2018-03-20 2018-08-10 北京大学 A method of based on LSTM auto-complete codes
CN108898015A (en) * 2018-06-26 2018-11-27 暨南大学 Application layer dynamic intruding detection system and detection method based on artificial intelligence
CN108985061A (en) * 2018-07-05 2018-12-11 北京大学 A kind of webshell detection method based on Model Fusion
CN109101235A (en) * 2018-06-05 2018-12-28 北京航空航天大学 A kind of intelligently parsing method of software program
CN109120617A (en) * 2018-08-16 2019-01-01 辽宁大学 Polymorphic worm detection method based on frequency CNN
CN109240922A (en) * 2018-08-30 2019-01-18 北京大学 The method that webshell software gene carries out webshell detection is extracted based on RASP
CN109462575A (en) * 2018-09-28 2019-03-12 东巽科技(北京)有限公司 A kind of webshell detection method and device
CN109635563A (en) * 2018-11-30 2019-04-16 北京奇虎科技有限公司 The method, apparatus of malicious application, equipment and storage medium for identification
CN109657466A (en) * 2018-11-26 2019-04-19 杭州英视信息科技有限公司 A kind of function grade software vulnerability detection method
CN109684844A (en) * 2018-12-27 2019-04-26 北京神州绿盟信息安全科技股份有限公司 A kind of webshell detection method and device
CN109905385A (en) * 2019-02-19 2019-06-18 中国银行股份有限公司 A kind of webshell detection method, apparatus and system
CN109933602A (en) * 2019-02-28 2019-06-25 武汉大学 A kind of conversion method and device of natural language and structured query language
CN110086788A (en) * 2019-04-17 2019-08-02 杭州安恒信息技术股份有限公司 Deep learning WebShell means of defence based on cloud WAF
CN110232280A (en) * 2019-06-20 2019-09-13 北京理工大学 A kind of software security flaw detection method based on tree construction convolutional neural networks
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
WO2019202436A1 (en) * 2018-04-16 2019-10-24 International Business Machines Corporation Using gradients to detect backdoors in neural networks
CN110502897A (en) * 2018-05-16 2019-11-26 南京大学 A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN110855661A (en) * 2019-11-11 2020-02-28 杭州安恒信息技术股份有限公司 WebShell detection method, device, equipment and medium
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111614599A (en) * 2019-02-25 2020-09-01 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN111611150A (en) * 2019-02-25 2020-09-01 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111741002A (en) * 2020-06-23 2020-10-02 广东工业大学 Method and device for training network intrusion detection model
CN112035099A (en) * 2020-09-01 2020-12-04 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112118225A (en) * 2020-08-13 2020-12-22 紫光云(南京)数字技术有限公司 Webshell detection method and device based on RNN
CN112132262A (en) * 2020-09-08 2020-12-25 西安交通大学 Recurrent neural network backdoor attack detection method based on interpretable model
CN112487368A (en) * 2020-12-21 2021-03-12 中国人民解放军陆军炮兵防空兵学院 Function level confusion detection method based on graph convolution network
CN113094706A (en) * 2020-01-08 2021-07-09 深信服科技股份有限公司 WebShell detection method, device, equipment and readable storage medium
CN113190849A (en) * 2021-04-28 2021-07-30 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium
CN113810375A (en) * 2021-08-13 2021-12-17 网宿科技股份有限公司 Webshell detection method, device and equipment and readable storage medium
CN114462033A (en) * 2021-12-21 2022-05-10 天翼云科技有限公司 Method and device for constructing script file detection model and storage medium
CN114499944A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Method, device and equipment for detecting WebShell
EP4105802A1 (en) * 2021-06-17 2022-12-21 Cylance Inc. Method, computer-readable medium and system to detect malicious software in hierarchically structured files

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101895420A (en) * 2010-07-12 2010-11-24 西北工业大学 Rapid detection method for network flow anomaly
CN103971054A (en) * 2014-04-25 2014-08-06 天津大学 Detecting method of browser extension loophole based on behavior sequence
CN105069355A (en) * 2015-08-26 2015-11-18 厦门市美亚柏科信息股份有限公司 Static detection method and apparatus for webshell deformation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101895420A (en) * 2010-07-12 2010-11-24 西北工业大学 Rapid detection method for network flow anomaly
CN103971054A (en) * 2014-04-25 2014-08-06 天津大学 Detecting method of browser extension loophole based on behavior sequence
CN105069355A (en) * 2015-08-26 2015-11-18 厦门市美亚柏科信息股份有限公司 Static detection method and apparatus for webshell deformation

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376283A (en) * 2018-01-08 2018-08-07 中国科学院计算技术研究所 Pond makeup for neural network is set and pond method
CN108376283B (en) * 2018-01-08 2020-11-03 中国科学院计算技术研究所 Pooling device and pooling method for neural network
CN108388425B (en) * 2018-03-20 2021-02-19 北京大学 Method for automatically completing codes based on LSTM
CN108388425A (en) * 2018-03-20 2018-08-10 北京大学 A method of based on LSTM auto-complete codes
GB2585616A (en) * 2018-04-16 2021-01-13 Ibm Using gradients to detect backdoors in neural networks
WO2019202436A1 (en) * 2018-04-16 2019-10-24 International Business Machines Corporation Using gradients to detect backdoors in neural networks
US11132444B2 (en) 2018-04-16 2021-09-28 International Business Machines Corporation Using gradients to detect backdoors in neural networks
CN110502897A (en) * 2018-05-16 2019-11-26 南京大学 A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN109101235A (en) * 2018-06-05 2018-12-28 北京航空航天大学 A kind of intelligently parsing method of software program
CN109101235B (en) * 2018-06-05 2021-03-19 北京航空航天大学 Intelligent analysis method for software program
CN108898015B (en) * 2018-06-26 2021-07-27 暨南大学 Application layer dynamic intrusion detection system and detection method based on artificial intelligence
CN108898015A (en) * 2018-06-26 2018-11-27 暨南大学 Application layer dynamic intruding detection system and detection method based on artificial intelligence
CN108985061A (en) * 2018-07-05 2018-12-11 北京大学 A kind of webshell detection method based on Model Fusion
CN109120617A (en) * 2018-08-16 2019-01-01 辽宁大学 Polymorphic worm detection method based on frequency CNN
CN109120617B (en) * 2018-08-16 2020-11-17 辽宁大学 Polymorphic worm detection method based on frequency CNN
CN109240922A (en) * 2018-08-30 2019-01-18 北京大学 The method that webshell software gene carries out webshell detection is extracted based on RASP
CN109462575A (en) * 2018-09-28 2019-03-12 东巽科技(北京)有限公司 A kind of webshell detection method and device
CN109462575B (en) * 2018-09-28 2021-09-07 东巽科技(北京)有限公司 Webshell detection method and device
CN109657466A (en) * 2018-11-26 2019-04-19 杭州英视信息科技有限公司 A kind of function grade software vulnerability detection method
CN109635563A (en) * 2018-11-30 2019-04-16 北京奇虎科技有限公司 The method, apparatus of malicious application, equipment and storage medium for identification
CN109684844A (en) * 2018-12-27 2019-04-26 北京神州绿盟信息安全科技股份有限公司 A kind of webshell detection method and device
CN109684844B (en) * 2018-12-27 2020-11-20 北京神州绿盟信息安全科技股份有限公司 Webshell detection method and device, computing equipment and computer-readable storage medium
CN109905385A (en) * 2019-02-19 2019-06-18 中国银行股份有限公司 A kind of webshell detection method, apparatus and system
CN109905385B (en) * 2019-02-19 2021-08-20 中国银行股份有限公司 Webshell detection method, device and system
CN111614599A (en) * 2019-02-25 2020-09-01 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN111611150A (en) * 2019-02-25 2020-09-01 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111611150B (en) * 2019-02-25 2024-03-22 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111614599B (en) * 2019-02-25 2022-06-14 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN109933602A (en) * 2019-02-28 2019-06-25 武汉大学 A kind of conversion method and device of natural language and structured query language
CN109933602B (en) * 2019-02-28 2021-05-04 武汉大学 Method and device for converting natural language and structured query language
CN110086788A (en) * 2019-04-17 2019-08-02 杭州安恒信息技术股份有限公司 Deep learning WebShell means of defence based on cloud WAF
CN110232280A (en) * 2019-06-20 2019-09-13 北京理工大学 A kind of software security flaw detection method based on tree construction convolutional neural networks
CN110232280B (en) * 2019-06-20 2021-04-13 北京理工大学 Software security vulnerability detection method based on tree structure convolutional neural network
WO2020259260A1 (en) * 2019-06-28 2020-12-30 华为技术有限公司 Structured query language (sql) injection detecting method and device
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN110855661A (en) * 2019-11-11 2020-02-28 杭州安恒信息技术股份有限公司 WebShell detection method, device, equipment and medium
CN110855661B (en) * 2019-11-11 2022-05-13 杭州安恒信息技术股份有限公司 WebShell detection method, device, equipment and medium
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111198817B (en) * 2019-12-30 2021-06-04 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN113094706A (en) * 2020-01-08 2021-07-09 深信服科技股份有限公司 WebShell detection method, device, equipment and readable storage medium
CN111741002B (en) * 2020-06-23 2022-02-15 广东工业大学 Method and device for training network intrusion detection model
CN111741002A (en) * 2020-06-23 2020-10-02 广东工业大学 Method and device for training network intrusion detection model
CN112118225A (en) * 2020-08-13 2020-12-22 紫光云(南京)数字技术有限公司 Webshell detection method and device based on RNN
CN112035099A (en) * 2020-09-01 2020-12-04 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112035099B (en) * 2020-09-01 2024-03-15 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112132262B (en) * 2020-09-08 2022-05-20 西安交通大学 Recurrent neural network backdoor attack detection method based on interpretable model
CN112132262A (en) * 2020-09-08 2020-12-25 西安交通大学 Recurrent neural network backdoor attack detection method based on interpretable model
CN112487368B (en) * 2020-12-21 2023-05-05 中国人民解放军陆军炮兵防空兵学院 Function level confusion detection method based on graph convolution network
CN112487368A (en) * 2020-12-21 2021-03-12 中国人民解放军陆军炮兵防空兵学院 Function level confusion detection method based on graph convolution network
CN113190849A (en) * 2021-04-28 2021-07-30 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium
CN113190849B (en) * 2021-04-28 2023-03-03 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium
EP4105802A1 (en) * 2021-06-17 2022-12-21 Cylance Inc. Method, computer-readable medium and system to detect malicious software in hierarchically structured files
CN113810375A (en) * 2021-08-13 2021-12-17 网宿科技股份有限公司 Webshell detection method, device and equipment and readable storage medium
CN114462033A (en) * 2021-12-21 2022-05-10 天翼云科技有限公司 Method and device for constructing script file detection model and storage medium
CN114499944B (en) * 2021-12-22 2023-08-08 天翼云科技有限公司 Method, device and equipment for detecting WebShell
CN114499944A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Method, device and equipment for detecting WebShell

Also Published As

Publication number Publication date
CN107516041B (en) 2020-04-03

Similar Documents

Publication Publication Date Title
CN107516041B (en) WebShell detection method and system based on deep neural network
CN111639344B (en) Vulnerability detection method and device based on neural network
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
WO2020259260A1 (en) Structured query language (sql) injection detecting method and device
CN111600919B (en) Method and device for constructing intelligent network application protection system model
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN111597803B (en) Element extraction method and device, electronic equipment and storage medium
CN111737289B (en) Method and device for detecting SQL injection attack
CN108664512B (en) Text object classification method and device
CN111931935B (en) Network security knowledge extraction method and device based on One-shot learning
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN111090860A (en) Code vulnerability detection method and device based on deep learning
CN109067708B (en) Method, device, equipment and storage medium for detecting webpage backdoor
CN111758098A (en) Named entity identification and extraction using genetic programming
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
Wang et al. File fragment type identification with convolutional neural networks
CN114329474A (en) Malicious software detection method integrating machine learning and deep learning
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack
CN113971283A (en) Malicious application program detection method and device based on features
CN116226864A (en) Network security-oriented code vulnerability detection method and system
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
CN114722389A (en) Webshell file detection method and device, electronic device and readable storage medium
An et al. Deep learning based webshell detection coping with long text and lexical ambiguity
CN113259369A (en) Data set authentication method and system based on machine learning member inference attack
Miao et al. AST2Vec: A Robust Neural Code Representation for Malicious PowerShell Detection

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant