CN107516041B - WebShell detection method and system based on deep neural network - Google Patents

WebShell detection method and system based on deep neural network Download PDF

Info

Publication number
CN107516041B
CN107516041B CN201710705914.1A CN201710705914A CN107516041B CN 107516041 B CN107516041 B CN 107516041B CN 201710705914 A CN201710705914 A CN 201710705914A CN 107516041 B CN107516041 B CN 107516041B
Authority
CN
China
Prior art keywords
tree
abstract syntax
webshell
syntax tree
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710705914.1A
Other languages
Chinese (zh)
Other versions
CN107516041A (en
Inventor
张涛
齐龙晨
宁戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Anpro Information Technology Co ltd
Original Assignee
Beijing Anpro Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Anpro Information Technology Co ltd filed Critical Beijing Anpro Information Technology Co ltd
Priority to CN201710705914.1A priority Critical patent/CN107516041B/en
Publication of CN107516041A publication Critical patent/CN107516041A/en
Application granted granted Critical
Publication of CN107516041B publication Critical patent/CN107516041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a WebShell detection method and a system thereof based on a deep neural network, wherein a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntax information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the feature extraction and WebShell detection comprise preprocessing, sample generation and WebShell detection; firstly, lexical and grammatical information of a script is automatically acquired, and then the characteristic extraction and WebShell detection are completed by using a recurrent neural network based on an abstract syntax tree. The method has the advantages of low deployment cost, good transportability and high detection accuracy.

Description

WebShell detection method and system based on deep neural network
Technical Field
The invention relates to the technical field of information security, in particular to a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree.
Background
WebShell is a command execution environment in the form of a web page, often used by intruders as a backdoor tool for operating web servers. An attacker obtains the management authority of the Web service through the WebShell, so that penetration and control on Web application are achieved.
Since the characteristics of the WebShell and the common Web page are almost consistent, the detection of the traditional firewall and antivirus software can be avoided. And with the application of various feature confusion hiding technologies for anti-detection to WebShell, a traditional detection mode based on feature code matching is difficult to detect new variants in time.
From the attacker perspective, WebShell is a script Trojan backdoor written by asp, aspx, php or jsp and the like. After an attacker invades a website, the script files are often uploaded to a Web server directory. By means of browser access, the Web server can be controlled while the script file is accessed, for example, data of a website database is read, files on the website server are deleted, and even system branch commands can be directly operated if the Web authority is high.
The existing WebShell detection methods are white box detection, namely detection is carried out on a source code of a WebShell script file, and the existing WebShell detection methods can be specifically divided into two types of detection based on a host and detection based on a network.
Host-based detection: among these methods, the detection method more common in the industry is to directly use known keywords as features, to search suspicious files by grep sentences and then to manually analyze the suspicious files, or to periodically check MD5 values of existing files and to check whether new files are generated. This intuitive detection method is easily circumvented by attackers using obfuscation means.
Network-based detection: the current existing method mainly focuses on configuring an intrusion detection system, namely a WAF, at a network entrance to detect WebShell, and judges whether an attacker uploads HTML or script files by analyzing whether special keywords (e.g., < form, <%, <. The method needs large expenditure and also has the possibility of false alarm; and can only detect the behavior of uploading WebShell, but cannot be detected by the existing WebShell.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of the script mainly through the program language processing technology, and completes feature extraction and WebShell detection by utilizing a deep neural network at the same time, wherein the method is mainly aimed at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like. The system mainly comprises three modules, namely a preprocessing module utilizing a programming language processing technology, a sample generation module completing vectorization expression and a detection module utilizing a deep learning technology. The method has the advantages of low deployment cost, good transportability and high detection accuracy.
The following are several typical neural network model-related term definitions:
the operational formula of the neural network can be defined as:
Figure BDA0001381462690000021
wherein o is(i)Output vector, o, representing the i-th layer of the neural network(i)The dimension of (a) is the number of neurons (number of network nodes); x is the number of(i)Is the output of the i-1 th layer of the network and is used as the input of the i-th layer; w(i)And b(i)Is a parameter of the i-th layer of the neural network; balance
Figure BDA0001381462690000022
For activating functions, in general
Figure BDA0001381462690000023
Is a non-linear function. This neural network layer is called a fully connected layer (Full connectivity layer).
A Recurrent Neural Network (RNN) is used to process sequence inputs. The recurrent neural network processes one input sequence element at a time while maintaining the historical state of all past time sequence elements with one hidden unit.
The calculation formula of the recurrent neural network layer is as follows:
Figure BDA0001381462690000024
Figure BDA0001381462690000025
wherein x istIs an input vector, StIs a hidden unit vector, otIs the output vector; w, U, V is a parameter that is,
Figure BDA0001381462690000026
is an activation function.
The first appearance of the Pooling Layer (Pooling Layer) in convolutional neural networks was a downsampling window that slides through the input matrix. And the pooling layer performs down-sampling in the corresponding matrix sub-area according to the sampling function each time, slides to the next position according to the specified step until the sampling in the whole input matrix is finished, and finally outputs the sampling result matrix to the next layer. The most common sampling methods are maximum sampling, minimum sampling and mean sampling.
The splicing Layer (splicing Layer) is responsible for merging the input k vectors into one output vector, namely:
o=i1&i2&,…,&ikwherein&Are hyphens.
The technical scheme provided by the invention is as follows:
a WebShell detection method based on deep Neural Network, based on the Recurrent Neural Network (AST _ RRNN, Recurrent Neural Network based on abstract syntax tree), utilize the hierarchical structural feature of the abstract syntax tree, carry on WebShell to the script language of the mainstream; the method comprises a pretreatment process, a sample generation process and a detection process; the method specifically comprises the following steps (the scheme flow diagram is specifically shown in figure 1):
A. the script file is firstly preprocessed: the preprocessing module comprises a lexical analyzer, a syntax analyzer and a simplification module, wherein the input is script source codes, and the output is an abstract syntax tree AST (abstract syntax tree), and the specific steps are as follows:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. the lexical analyzer analyzes the lexical unit flow to construct an abstract syntax tree;
A3. and simplifying the steps, namely filtering out semantic irrelevant information such as comments and the like after the step of grammar analysis.
B. And (4) generating a sample. The sample generation module of the WebShell detection method of AST _ RRNN comprises two parts of input contents: simplified AST and AST leaf nodes. However, since the difference in the size of the abstract syntax tree (the number of nodes of the tree) adversely affects the training and prediction of the detection module, the abstract syntax tree needs to be compressed before vectorization. And the sample generation module is responsible for converting the abstract syntax tree into a vectorized representation that facilitates training and prediction by the detection module. The compression steps of the abstract syntax tree are as follows:
B1. the compression of abstract syntax tree mainly utilizes the concepts and methods of n-node sampling sub-tree and m-branch tree transformation to limit the size of abstract syntax tree, and in addition, simple characteristic engineering method is utilized to complete the vectorization representation of leaf nodes.
B2. Vectorized representation of tree nodes. Adopting a One Hot Encoding (One Hot Encoding) method as the most intuitive vectorization Encoding method, adopting a node v of a One Hot Encoding vectorization abstract syntax tree, and marking as One _ Hot (v) to represent the node type of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
C. And (3) detection process: and a deep neural network is adopted as a detection module. And aiming at the tree structure of the abstract syntax tree, a recursive cyclic neural network is adopted. The method comprises the following steps:
C1. aiming at the tree structure, the scheme defines a neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The recursion _ LSTM layer exploits the Recursive nature of trees whose vector representation is generated by some non-linear operation from the vector representations of their root nodes and subtree sets.
C2. The vectorized representation of the root node in the tree structure is the same as the vectorized representation of the tree node in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the recursive long-short term memory layer. Let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C1,c2,…,ci,…,c|c|Is a set of corresponding subtrees
Figure BDA0001381462690000031
Wherein c isiIs composed of
Figure BDA0001381462690000032
The root node of (2). The calculation formula of the T vectorization expression is formula 1:
Figure BDA0001381462690000042
wherein,
Figure BDA0001381462690000043
representing an activation function, Wroot、WpickupAnd WsubtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the recurved _ LSTM layer represented by the vectorization representation of each m-ary tree in F, and is represented by formula 2:
Figure BDA0001381462690000041
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; 2) and the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:
c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as FeatureR=[f1,f2,…,fk]T
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to process FeatureRDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Featurep=[fmax,fmin,fmean]T
C33. Splicing Layer (ligation Layer) will FeaturepAnd the vector f corresponding to the leaf featuresSpliced into a vector, FeatureA=fmax&fmin&fmean&fs,(&A representation tile); featureAIs for the entropy of information, longestThe words,
Representing a coincidence index, a compression ratio and a danger function;
C34. subsequent full connectivity layer utilization FeatureAAnd performing WebShell judgment.
Specifically, given a decision threshold, when FeatureAAnd when the judgment threshold limit is exceeded, the file is identified as the WebShell file.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, the accuracy Precision is set as U, the Recall rate is set as V, and the accuracy U is the number of correct extracted information pieces/the number of extracted information pieces, which is also called Precision; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. In the implementation of the invention, a Precision-Recall curve is drawn to help the adjustment analysis and the parameters are selected.
The invention discloses a recursive recurrent neural network detection method based on an abstract syntax tree. First, the script code is converted into an abstract syntax tree using a lexical analyzer. Then, a compression algorithm of the abstract syntax tree is invented. And finally, aiming at the structural characteristics of the abstract syntax tree, the invention provides a recurrent neural network model as a detection module.
In another aspect, the present invention further provides a WebShell detection system of a recurrent neural network based on an abstract syntax tree, where the system includes:
1. the preprocessing module is used for outputting an abstract syntax tree by using a syntax analyzer as a preprocessing module and using script source codes as input through syntax analysis;
2. the sample generation module, the sample generation module of the AST _ RRNN method, is responsible for converting the abstract syntax tree into a vector representation that facilitates the training and prediction of the detection module. The vectorized representation of the abstract syntax tree is divided into two parts: 1) and performing vectorization representation on the leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method. 2) Designing a sampling algorithm to limit the scale of an abstract syntax tree part consisting of intermediate nodes, wherein the basic idea is to replace an original abstract syntax tree by utilizing a group of sampling subtrees with smaller scale;
3. the detection module is a deep neural network model, and constructs a recurrent neural network (RRNN). The user-defined Recursive long-short term memory layer recursion _ LSTM provides a bottom-up operation mode for the tree structure. Because the input of the RRNN is k tree structures, the bottom end of the RRNN is k recurrent _ LSTM layers sharing parameters, the operation result is spliced with the vector expression of the leaf nodes after being processed by the pooling layer, and finally the operation result is input into a subsequent full-connection layer.
The invention has the beneficial effects that:
the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of a script by the program language processing technology aiming at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like, and completes feature extraction and WebShell detection by utilizing a deep neural network. The WebShell detection by using the technical scheme provided by the invention has the following advantages:
1) the features are automatically extracted, and the dependence on feature engineering is avoided;
2) the portability is good, and the thought and the flow are suitable for any scripting language;
3) the static detection method can be deployed at a Web server side in a light weight mode, and is low in deployment and detection cost;
4) the detection accuracy is high: compared with various test modes, the WebShell detection method based on the recurrent neural network of the abstract syntax tree can effectively deal with some relatively new WebShell type files (such as 0dayWebShell), and has good searching and killing effects on some deformed, encrypted and existing WebShell files.
Drawings
Fig. 1 is a flow chart of the WebShell detection method provided by the present invention.
Fig. 2 is a block diagram of a flow of a preprocessing module in the WebShell file detection process in the embodiment of the present invention.
Fig. 3 is a block flow diagram of a sample generation module in the WebShell file detection process in the embodiment of the present invention.
Fig. 4 is a block diagram of a flow of a detection module in the WebShell file detection process in the embodiment of the present invention.
Fig. 5 is a block diagram of the system provided by the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a method and a system for detecting a WebShell (WebShell) based on a recurrent neural network of an abstract syntax tree, wherein the system comprises a preprocessing module, a sample generating module and a detecting module; the detection of the WebShell file of the website is realized through the above process, and the specific implementation mode of the invention is as follows (in this example, PHP script is taken as an example for explanation, and other types of script languages are the same):
A. the preprocessing module, this part includes lexical analyzer, syntax analyzer, simplification step, specifically as follows (as fig. 2):
A1. a lexical analyzer, a PHP type file F containing program codes (script source codes), and generating a lexical unit stream WS after the lexical analysis;
A2. and (4) utilizing a syntax analyzer PHP-parser to perform syntax analysis on the WS to construct an abstract syntax tree AST.
The parsing process typically filters out semantically irrelevant information, such as annotations. The syntactic analysis is based on lexical analysis, which is more stringent than the lexical analysis rules. Meanwhile, compared with the lexical unit stream, the abstract syntax tree can reflect the code structure information more accurately.
A3. The abstract syntax tree is simplified, the structure of the abstract syntax tree after PHP-pharse analysis is clear, but the abstract syntax tree is slightly redundant, the abstract syntax tree needs to be simplified, and the simplification steps are as follows:
A31. deleting all leaf nodes of the abstract syntax tree, and simultaneously, in order to not lose leaf node characteristics, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method through a sample generation module;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
B. Sample generation module (as detailed in fig. 3):
B1. in the compression of the abstract syntax tree, since the difference in the size of the abstract syntax tree (the nodes of the tree) adversely affects the training and prediction of the detection module, it is necessary to compress the abstract syntax tree before vectorizing the abstract syntax tree. The abstract syntax tree is limited in size mainly by the concept and method of n-node sampling sub-tree and m-tree transformation. The specific compression steps are as follows:
B11. and repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T (V, E). The result returned by the n-node sampling subtree algorithm is called a sampling subtree. This step therefore ultimately produces a set of sampled subtrees of size K, denoted as
Figure BDA0001381462690000061
Figure BDA0001381462690000062
Wherein for any
Figure BDA0001381462690000063
Satisfy the requirement of
Figure BDA0001381462690000064
Does not exceed n.
B12. At FsampleIn which a subset F of size k is determinedselectI.e. Fselect∈FsampleAnd | FselectI | ═ k, such that
Figure BDA0001381462690000065
Satisfies the following conditions:
Figure BDA0001381462690000071
wherein the T () function is a value evaluation function for evaluating the argument FsubThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation FsubThe amount of information that can contribute to the WebShell detection conclusion. The form and meaning of the T () function need to be customized, and the value evaluation function is defined as:
Figure BDA0001381462690000072
wherein, ω is1、ω2、ω3The values are all set to be 1 in the scheme; sigma (),
Figure BDA0001381462690000073
pi () is three value ranges of [0,1 ]]Respectively, for measuring FselectCoverage, suspicion, and variability. In the scheme, T () is the linear sum of three types of index values. Wherein:
coverage σ () function, in this scheme F is expectedselectContains as many nodes as possible in T to obtain more information of T. Thus, the coverage is defined as FselectAt the ratio of the size of the set of nodes to | V |, the formula:
Figure BDA0001381462690000074
wherein,
Figure BDA0001381462690000075
is composed of
Figure BDA0001381462690000076
A set of points.
Suspicion of
Figure BDA0001381462690000077
A function. Suppose that
Figure BDA0001381462690000078
Are two n-node sampled sub-trees of the abstract syntax tree T. If it is
Figure BDA0001381462690000079
Just corresponding to the WebShell functional part in the source code,
Figure BDA00013814626900000710
the corresponding malicious-free obfuscated code portion, in the WebShell detection problem,
Figure BDA00013814626900000711
obvious ratio
Figure BDA00013814626900000712
More has "suspicious degree". Similarly, for node viRatio vjMore "suspicion degree", therefore, the suspicion degree of the node v is defined as:
Figure BDA00013814626900000713
wherein, cvWebShell represents the number of times v appears in all WebShell scripts in the training set, cvAll indicates the number of times v occurs in all scripts in the training set. Defining n-node sampling subtree TsampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
Figure BDA00013814626900000714
accordingly, define FselectIs FselectAverage value of suspicion degree of all n-node sampling subtrees:
Figure BDA00013814626900000715
a diversity pi () function.If it is
Figure BDA00013814626900000716
And
Figure BDA00013814626900000717
node types and structures are almost the same, then set
Figure BDA00013814626900000718
Is unlikely to provide a ratio
Figure BDA0001381462690000081
More useful information. Therefore, F is desiredselectThe sampled subtrees in (1) are as dissimilar as possible. Definition FselectThe diversity of (A) is:
Figure BDA0001381462690000082
wherein, the Tree _ Diversity () is a Tree distance algorithm, and the distance between two trees is calculated.
The Tree distance algorithm Tree _ Diversity () is specifically as follows:
Figure BDA0001381462690000083
Figure BDA0001381462690000091
B13. to pair
Figure BDA0001381462690000092
All the sampled subtrees in (1), perform an m-ary tree transformation algorithm, denoted as
Figure BDA0001381462690000093
The m-ary tree transformation algorithm limits the size of the child nodes of any tree to m, and ensures that the size of the tree with any size of n is smaller than 2n after the m-ary tree transformation is carried out on the tree with any size of n. F is to betransferInstead of the abstract syntax tree T, as input to the detection module.
And (5) m-ary tree transformation algorithm. The idea is as follows: if the child node set C of the node v exceeds the size limit, namely | C | > m, a layer of filling nodes is added between v and C until the child node size of v meets the limit. The filling node only reduces the size of the child node, does not contain any syntactic semantic information, and is defined as a 0 vector when the vector is represented. The specific algorithm is as follows:
Figure BDA0001381462690000094
let T besampleThe sub-tree is sampled for n nodes, and the tree T is obtained after m-way tree transformationtransferObviously, the size of the child node is not larger than m. The filling nodes introduced in the m-ary tree transformation process are all TtransferSo that the number of padding nodes is necessarily less than | VsampleL, i.e. TtransferIs less than 2| Vsample|。
At this point, the compression process for the abstract syntax tree is complete.
B2. Vectorized representation of tree nodes. The vectorization representation of the tree nodes adopts a One hot encoding (One hot encoding) method, the method is the most intuitive vectorization encoding method, and the nodes v of the vectorization abstract syntax tree adopting the One hot encoding are marked as One _ hot (v) and represent the node types of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.
Let T betransfer={Vtransfer,EtransferAnd generating an n-node sampling sub-tree of T ═ V, E through m-ary tree transformation. To VtransferOf arbitrary non-filler nodes v, T and TtransferSubtrees T with v as root node(v)And
Figure BDA0001381462690000101
in the vectorization process, the vectorized representation of v consists of two parts: the first part is used for representing the type of the node v, and adopts a one-hot coding mode, and is marked as:
Encode(v)=one_hot(v)
the second part being intended to represent T(v)Is not sampled, and
Figure BDA0001381462690000102
the calculation formula of the node set of nodes that have not been "picked back" is:
Figure BDA0001381462690000103
for filler nodes, both partial representations are specified as 0 vectors.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
And establishing a danger function list aiming at the script language, and searching whether the character string scalar node contains a danger function field or not by comparing the danger function list. And the risk function characteristics are vectorized and expressed by a bag-of-words model. The length of the feature vector is equal to the length of the list of the danger functions; the statistical characteristics of the character strings are interpreted from the mathematical point of view, and after the character strings are subjected to fuzzy processing such as confusion, encoding, encryption and the like, certain mathematical statistics of the character strings are usually deviated from the probability distribution of the character strings in a normal script. This is also the rationale for the NeoPi (an open source tool published by Neohapsis on github) method. The NeoPi is a script tool written by Python, detects malicious codes existing in texts and script files by using various statistical methods, and mainly detects the malicious codes by extracting the information entropy, the longest word, the coincidence index, the characteristics and the compression ratio of the files. The method selects 4 important indexes of the character string length, the coincidence index, the information entropy and the file compression ratio in the NeoPi method, and inspects each character string constant in the script.
String Length (Length of String). The string constants in the normal code are concise, and the code fragments are embedded into the string constants by part of WebShell, so that long strings are more likely to appear in the WebShell script compared with normal scripts.
Coincidence Index (Index of Coincidence). The coincidence index is one way to determine whether a file is encrypted or encoded. The calculation formula is as follows:
IC(s)=∑(fi*fi-1)/N*(N-1)
wherein f isiRepresenting the frequency of occurrence of the character i in the string s in the sample, and N is the length of the string. Statistics show that the coincidence index of a meaningful english text is 0.0667, while the index of a completely random english string is 0.0385. That is, when the coincidence index of an english string is close to 0.0385, we just tend to consider it encrypted or encoded, thereby further inferring that the script is likely to be WebShell.
Entropy of Information (Entropy of Information). Information entropy is a basic concept in information theory and is a measure of the degree of system ordering. The calculation formula is as follows:
H(s)=-∑pi*logpi
wherein p isiThe proportion of the string i that appears in the string s. Therefore, when the character string is pseudo-randomized by encryption or coding, the information entropy increases, and therefore the larger the information entropy value is, the higher the possibility of WebShell is.
The file compression ratio, which is defined as the ratio of the uncompressed file size to the compressed file size. The essence of data compression is to eliminate the imbalance in the distribution of specific characters, and to achieve length optimization by assigning short codes to high frequency characters, while low frequency characters use long codes. A web page file encoded by base64, with non-ASCII characters removed, will appear as a smaller distribution imbalance, with the larger compression ratio, calculated as follows:
Figure BDA0001381462690000111
wherein zip () represents compressing data and length () represents calculating data length.
C. A detection module (see fig. 4). The method adopts a deep neural network as a detection module, adopts a recursion cycle neural network aiming at a tree structure of an abstract syntax tree, and comprises the following concrete steps:
C1. aiming at the tree structure, the scheme defines a new neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The basic idea of the recurve _ LSTM layer is: using the recursive nature of the tree, a vector representation of the tree is generated by some non-linear operation from vector representations of its root node and set of subtrees.
C2. The vectorized representation of the root node is identical to the vectorized representation of the tree nodes in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the long-short term memory layer. Formally, let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C1,c2,…,ci,…,c|c|Is a set of corresponding subtrees
Figure BDA0001381462690000112
Wherein c isiIs composed of
Figure BDA0001381462690000113
The root node of (2). The calculation formula of the T vectorization representation is:
Figure BDA0001381462690000115
wherein,
Figure BDA0001381462690000116
representing an activation function, Wroot、WpickupAnd WsubtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the vectorized representation of each m-ary tree in F into the LSTM layer, and is represented as:
Figure BDA0001381462690000114
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees represent intermediate nodes of the abstract syntax tree; 2) the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:
c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as FeatureR=[f1,f2,…,fk]T
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to F at the same timeRDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Featurep=[fmax,fmin,fmean]T
C32. Splicing Layer (splicing Layer) FpAnd the vector f corresponding to the leaf featuresSpliced into a vector, FeatureA=fmax&fmin&fmean&fs。(&Express splice mark)
C34. Subsequent full connectivity layer utilization FeatureAAnd performing WebShell judgment.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, setting the accuracy as U, the recall rate as V, and the accuracy U being the number of correct extracted information pieces/the number of extracted information pieces, which is also called precision rate; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. Generally, the accuracy Precision is how many retrieved items are accurate, and the Recall is how many all accurate items are retrieved. In practice, it is of course desirable that the search result Precision is as high as possible, and that the Recall is as high as possible, but in fact, the two are in some cases contradictory. For example, in an extreme case, only one result is searched in the experiment and is accurate, then Precision is 100%, but Recall is very low; if all results are returned, then for example Recall is 100%, but Precision is low. Therefore, in different situations, it is necessary to judge whether a higher Precision or a higher Recall is desired.
The details of RRNN are shown in Table 1-1. In the training process, a binary cross information entropy function is adopted as a loss function, a method SGD (stochastic gradient descent) is adopted as a training method, the number of samples in each batch of training is 32, and the training iteration number is 1000.
TABLE 1 AST _ RRNN method detailed description of detection module parameters
Figure BDA0001381462690000121
Figure BDA0001381462690000131
The invention is further illustrated by the following examples.
Example (b):
the scheme adopts supervised training, and the mainstream method for training the deep neural network is a random gradient descent (SGD) method and a deformation form thereof. The method inputs a group of training samples into the neural network every time, and updates parameters of the neural network by using the value of the objective function until the value of the objective function is converged. The specific updating method is to move all parameters in the neural network by a small step along the direction of gradient decrease of the objective function (the opposite direction of the derivative).
The sample set of the example is selected, and the sample set contains a large number of normal scripts and 6669 WebShell scripts. 100000 scripts are drawn from the normal sample set for training the token's word vector. The remaining normal scripts are randomly extracted 6669, and form a training set of classification problems together with all WebShell scripts.
Tables 1-2 example training set and test set partitioning of data sets
Training set Test set Total of
WebShell script 5187 1482 6669
Normal script 5187 1482 6669
1) Firstly, using the lexical analysis results of 100000 PHP scripts as input;
2) generating an abstract syntax tree by using the PHP-parser;
3) the determination of 4 key parameters in the sample generation module is ① n, namely the scale of a limiting tree, ② m, namely the scale of sub nodes of the limiting tree, ③ K, namely the size of a sampling subtree set, ④ K, namely the number of final input m-ary trees, in the RRNN training process, when any abstract syntax tree T is (V, E), when a sample is constructed, n is fixed to 1000, m is fixed to 10, K is min (50,
Figure BDA0001381462690000132
) And K is min (K, 10). And when the training of the RRNN model is finished, fixing 3 parameter values in each training process, respectively taking different values for the residual variables, and recording the detection result.
4) In the test process, it is found that the detection effect of the AST _ RRNN method is generally improved when the values of n, m, K, and K are increased. Therefore, in the detection process, the values of the 4 parameters can be properly increased according to the size of the abstract syntax tree, so that the detection accuracy is improved.
The AST _ RRNN method utilizes two types of features, ① features extracted from leaf nodes, ② features extracted from abstract syntax trees, and retrains the parameters that adjust the RRNN based on the trained RRNN using the leaf node features and the abstract syntax tree features, respectively.
1) Accuracy 0.9886 using leaf node features and abstract syntax tree features as input;
2) accuracy 0.7649 when only leaf node features are used;
3) accuracy 0.8659 when only abstract syntax tree features are used;
the detection effect of the abstract syntax tree as the characteristic is obviously higher than that of the detection result of the leaf node as the characteristic, which shows that the structural information in the abstract syntax tree is important for WebShell detection. Moreover, only by using a single feature, both the abstract syntax tree and the leaf node can cause the accuracy rate to be reduced by at least 10%, the phenomenon is that the leaf node features describe key information of a data transmission part, the abstract syntax tree is an accurate description of a data execution part, and the two functions together guarantee the detection result of the AST _ RRNN method.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (10)

1. A WebShell detection method based on a deep neural network is characterized in that a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntactic information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the WebShell detection method comprises a preprocessing process, a sample generation process and a detection process; the method specifically comprises the following steps:
A. the script file preprocessing process comprises the following steps:
the input is script source code, the preprocessing comprises lexical analysis, syntactic analysis and simplification, and the output is abstract syntax tree T ═ (V, E), wherein V is a set of leaf nodes in T, and E is a set of edges in T;
B. and (3) a sample generation process:
inputting leaf nodes comprising the simplified abstract syntax tree and the abstract syntax tree; the method comprises the following steps: compressing the abstract syntax tree and vectorizing the abstract syntax tree, wherein the vectorized abstract syntax tree comprises vectorized expressions of tree nodes and leaf nodes;
C. adopting a deep neural network to carry out WebShell detection: aiming at the tree structure of the abstract syntax tree, the deep neural network adopts a recursion cycle neural network; the method comprises the following steps:
C1. defining a neural network layer as a recursive long-term and short-term memory layer aiming at a tree structure, wherein the recursive long-term and short-term memory layer utilizes the recursive characteristic of the tree and is represented by vectors of a root node and a subtree set of the tree, and the vectors of the tree are represented by nonlinear operation;
C2. the vectorization representation method of the root node in the tree structure adopts the same method as the vectorization representation of the tree node in the step B; vector representation of a sub-tree set in the tree structure is generated by inputting sub-trees into a recursive long-short term memory layer in sequence;
C3. designing a recurrent neural network RRNN as a detection module by utilizing the recurrent long and short term memory layer;
the inputs to the RRNN include: k vectorized m-order multi-way trees representing intermediate nodes of the abstract syntax tree; a fixed-length vector representing a leaf node of the abstract syntax tree; the operation process of the RRNN comprises the following steps:
the bottom of RRNN comprises k recursive long and short term memory layers sharing weight, corresponding to k m-trees, and outputting k x d dimensional characteristics through calculation, which is recorded as FeatureR=[f1,f2,…,fk]T,fkIs k d-dimension feature vector;
c32.RRNN pooling layerSimultaneously using three down-sampling functions of maximum value, minimum value and mean value to FeatureRThe down sampling operation is carried out according to the columns, and the pooling layer outputs three d-dimensional vectors which are marked as Featurep=[fmax,fmin,fmean]TWherein f ismaxD-dimensional vector, f, output using maximum sampling function for pooling layersminD-dimensional vector, f, output using minimum sampling function for pooling layersmeanD-dimensional vectors output by using a mean sampling function for the pooling layer;
RRNN splice layer FeaturepAnd vector f corresponding to leaf featuresSpliced into a vector to obtain a spliced Feature vector FeatureA=fmax&fmin&fmean&fs,&Representing a splice;
c34.RRNN full connectivity layer utilizes Feature vector FeatureAAnd performing WebShell judgment.
2. The WebShell detection method as recited in claim 1, wherein the step of preprocessing the script file specifically comprises:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. carrying out lexical analysis on the lexical unit flow to construct an abstract syntax tree;
A3. and filtering the lexical unit streams subjected to the syntactic analysis to remove semantic irrelevant information so as to achieve the purpose of simplifying the abstract syntax tree.
3. The WebShell detection method of claim 2, wherein the step a3 of simplifying the abstract syntax tree comprises the steps of:
A31. deleting all leaf nodes of the abstract syntax tree, simultaneously, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method when generating a sample, wherein the leaf node characteristics are not lost;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
4. The WebShell detection method of claim 1, wherein the step B sample generation process comprises:
B1. compression of abstract syntax trees: limiting the size of the abstract syntax tree by using an n-node sampling sub-tree and an m-ary tree transformation method; vectorization representation of the leaf nodes can be completed by utilizing a characteristic engineering method;
B2. vectorization represents tree nodes: the vectorization coding method adopts a one-hot coding method, and adopts a node v of a one-hot coding vectorization abstract syntax tree, which is marked as one _ hot (v) and represents the node type of v; adopting a bag-of-words model vectorization abstract syntax tree T, recording as BoW (T), and representing the number of each type of nodes in the T;
B3. vectorization represents a leaf node: and extracting the leaf nodes of the character string scalar type to obtain a danger function characteristic and a character string statistical characteristic.
5. The WebShell detection method of claim 4, wherein step B1 includes the following steps:
B11. repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T ═ V, E, generating a sampling subtree set with the size of K, and recording the sampling subtree set as K
Figure FDA0002275018650000021
Wherein for any
Figure FDA0002275018650000022
Figure FDA0002275018650000023
Satisfy the requirement of
Figure FDA0002275018650000024
The scale of (a) does not exceed n;
Figure FDA0002275018650000025
sampling a sub-tree generated by a sub-tree algorithm for the nth node at the Kth time;
B12. at FsampleIn order to obtain a sizeSubset F of kselectI.e. Fselect∈FsampleAnd | FselectI | ═ k, such that
Figure FDA0002275018650000026
Satisfies the following formula:
Figure FDA0002275018650000027
wherein the T () function is a value evaluation function for evaluating the argument FsubThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation FsubThe amount of information which can contribute to the WebShell detection conclusion is obtained; the value evaluation function T () is defined as follows:
Figure FDA0002275018650000028
wherein, ω is1、ω2、ω3Is a constant; coverage function σ (), suspicion function
Figure FDA0002275018650000031
The diversity function pi () has three value ranges as the interval [0,1 ]]Respectively, for measuring FselectCoverage, suspicion, and variability of; the cost evaluation function T () is a linear sum of three types of metric values.
6. The WebShell detection method of claim 5, wherein, in the definition of the value evaluation function T (),
wherein, ω is1、ω2、ω3Is a constant; preferably, ω is1、ω2、ω3Are all set to 1.
7. The WebShell detection method of claim 5, wherein the coverage function σ () is FselectRatio of size of node set to | V |:
Figure FDA0002275018650000032
wherein,
Figure FDA0002275018650000033
is composed of
Figure FDA0002275018650000034
A set of points;
function of doubtability
Figure FDA0002275018650000035
Is defined as: suppose that
Figure FDA0002275018650000036
Is two n-node sampling subtrees of the abstract syntax tree T; if it is
Figure FDA0002275018650000037
Just corresponding to the WebShell functional part in the source code,
Figure FDA0002275018650000038
and the corresponding part of the obfuscated code without malice is detected by the WebShell,
Figure FDA0002275018650000039
ratio of
Figure FDA00022750186500000310
More has 'doubtful degree'; for node viRatio vjMore has 'doubtful degree'; the suspiciousness of the node v is defined as:
Figure FDA00022750186500000311
wherein, cvThe WebShell represents the number of times that v appears in all WebShell scripts in the training set; c. CvAll means v is trainingTraining the occurrence frequency of all scripts; defining n-node sampling subtree TsampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
Figure FDA00022750186500000312
accordingly, define FselectIs FselectAverage value of suspicion degree of all n-node sampling subtrees:
Figure FDA00022750186500000313
the definition of the diversity function pi () is specifically: definition FselectThe diversity of (a) is as follows:
Figure FDA00022750186500000314
wherein, the distance between two trees is calculated through Tree _ Diversity ().
8. The WebShell detection method of claim 1, wherein in step C2, assuming that the root node of the tree T ═ (V, E) is r, the set of child nodes in r is C ═ C1,c2,…,ci,…,c|c|Is the corresponding subtree set
Figure FDA0002275018650000041
Figure FDA0002275018650000042
Wherein c isiIs composed of
Figure FDA0002275018650000043
A root node of; c. C|c|The last child node of the root node r; the tree T is represented vectorized by equation 1:
Figure FDA0002275018650000044
Figure FDA0002275018650000045
wherein,
Figure FDA0002275018650000046
representing an activation function; wroot、WpickupAnd WsubtreeIs a parameter; encode (F) is the final output result of the recursive long and short term memory layer which is sequentially input by the vectorization representation of each m-ary tree in F, and is expressed as formula 2.
9. The WebShell detection method of claim 1, wherein in step C34, the fully-connected layer utilizes Feature vectors FeatureAMaking a WebShell decision, specifically, giving a decision threshold when FeatureAAnd when the judgment threshold is exceeded, identifying the file as the WebShell file.
10. The WebShell detection system realized by the WebShell detection method of claims 1-9 comprises a preprocessing module, a sample generation module and a detection module;
the preprocessing module takes script source codes as input by using a syntax analyzer and outputs an abstract syntax tree through syntax analysis;
the sample generation module is configured to translate an abstract syntax tree into vector expressions that facilitate training and prediction by a detection module, and includes: performing vectorization representation on leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method; limiting the scale of an abstract syntax tree part consisting of intermediate nodes through a sampling algorithm, and replacing an original abstract syntax tree by using a group of sampling subtrees with smaller scale;
the detection module is a deep neural network model, constructs a recurrent neural network, self-defines a recurrent long-term and short-term memory layer and provides bottom-up operation on the tree structure; the bottom end of the recurrent neural network is a recurrent _ LSTM layer with k shared parameters, the input of the recurrent _ LSTM layer is k tree structures, the operation result is spliced with the vector expression of the leaf nodes after being processed by a pooling layer, and finally the operation result is input into a full connection layer for WebShell detection.
CN201710705914.1A 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network Active CN107516041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710705914.1A CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710705914.1A CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Publications (2)

Publication Number Publication Date
CN107516041A CN107516041A (en) 2017-12-26
CN107516041B true CN107516041B (en) 2020-04-03

Family

ID=60723188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710705914.1A Active CN107516041B (en) 2017-08-17 2017-08-17 WebShell detection method and system based on deep neural network

Country Status (1)

Country Link
CN (1) CN107516041B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376283B (en) * 2018-01-08 2020-11-03 中国科学院计算技术研究所 Pooling device and pooling method for neural network
CN108388425B (en) * 2018-03-20 2021-02-19 北京大学 Method for automatically completing codes based on LSTM
US11132444B2 (en) * 2018-04-16 2021-09-28 International Business Machines Corporation Using gradients to detect backdoors in neural networks
CN110502897A (en) * 2018-05-16 2019-11-26 南京大学 A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN109101235B (en) * 2018-06-05 2021-03-19 北京航空航天大学 Intelligent analysis method for software program
CN108898015B (en) * 2018-06-26 2021-07-27 暨南大学 Application layer dynamic intrusion detection system and detection method based on artificial intelligence
CN108985061B (en) * 2018-07-05 2021-10-01 北京大学 Webshell detection method based on model fusion
CN109120617B (en) * 2018-08-16 2020-11-17 辽宁大学 Polymorphic worm detection method based on frequency CNN
CN109240922B (en) * 2018-08-30 2021-07-09 北京大学 Method for extracting webshell software gene to carry out webshell detection based on RASP
CN109462575B (en) * 2018-09-28 2021-09-07 东巽科技(北京)有限公司 Webshell detection method and device
CN109657466A (en) * 2018-11-26 2019-04-19 杭州英视信息科技有限公司 A kind of function grade software vulnerability detection method
CN109635563A (en) * 2018-11-30 2019-04-16 北京奇虎科技有限公司 The method, apparatus of malicious application, equipment and storage medium for identification
CN109684844B (en) * 2018-12-27 2020-11-20 北京神州绿盟信息安全科技股份有限公司 Webshell detection method and device, computing equipment and computer-readable storage medium
CN109905385B (en) * 2019-02-19 2021-08-20 中国银行股份有限公司 Webshell detection method, device and system
CN111614599B (en) * 2019-02-25 2022-06-14 北京金睛云华科技有限公司 Webshell detection method and device based on artificial intelligence
CN111611150B (en) * 2019-02-25 2024-03-22 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN109933602B (en) * 2019-02-28 2021-05-04 武汉大学 Method and device for converting natural language and structured query language
CN110086788A (en) * 2019-04-17 2019-08-02 杭州安恒信息技术股份有限公司 Deep learning WebShell means of defence based on cloud WAF
CN110232280B (en) * 2019-06-20 2021-04-13 北京理工大学 Software security vulnerability detection method based on tree structure convolutional neural network
CN110362597A (en) * 2019-06-28 2019-10-22 华为技术有限公司 A kind of structured query language SQL injection detection method and device
CN110855661B (en) * 2019-11-11 2022-05-13 杭州安恒信息技术股份有限公司 WebShell detection method, device, equipment and medium
CN111198817B (en) * 2019-12-30 2021-06-04 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN113094706A (en) * 2020-01-08 2021-07-09 深信服科技股份有限公司 WebShell detection method, device, equipment and readable storage medium
CN111741002B (en) * 2020-06-23 2022-02-15 广东工业大学 Method and device for training network intrusion detection model
CN112118225B (en) * 2020-08-13 2021-09-03 紫光云(南京)数字技术有限公司 Webshell detection method and device based on RNN
CN112035099B (en) * 2020-09-01 2024-03-15 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112132262B (en) * 2020-09-08 2022-05-20 西安交通大学 Recurrent neural network backdoor attack detection method based on interpretable model
CN112487368B (en) * 2020-12-21 2023-05-05 中国人民解放军陆军炮兵防空兵学院 Function level confusion detection method based on graph convolution network
CN113190849B (en) * 2021-04-28 2023-03-03 重庆邮电大学 Webshell script detection method and device, electronic equipment and storage medium
US20220405572A1 (en) * 2021-06-17 2022-12-22 Cylance Inc. Methods for converting hierarchical data
CN113810375B (en) * 2021-08-13 2023-01-20 网宿科技股份有限公司 Webshell detection method, device and equipment and readable storage medium
CN114462033A (en) * 2021-12-21 2022-05-10 天翼云科技有限公司 Method and device for constructing script file detection model and storage medium
CN114499944B (en) * 2021-12-22 2023-08-08 天翼云科技有限公司 Method, device and equipment for detecting WebShell

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101895420A (en) * 2010-07-12 2010-11-24 西北工业大学 Rapid detection method for network flow anomaly
CN103971054A (en) * 2014-04-25 2014-08-06 天津大学 Detecting method of browser extension loophole based on behavior sequence
CN105069355A (en) * 2015-08-26 2015-11-18 厦门市美亚柏科信息股份有限公司 Static detection method and apparatus for webshell deformation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101895420A (en) * 2010-07-12 2010-11-24 西北工业大学 Rapid detection method for network flow anomaly
CN103971054A (en) * 2014-04-25 2014-08-06 天津大学 Detecting method of browser extension loophole based on behavior sequence
CN105069355A (en) * 2015-08-26 2015-11-18 厦门市美亚柏科信息股份有限公司 Static detection method and apparatus for webshell deformation

Also Published As

Publication number Publication date
CN107516041A (en) 2017-12-26

Similar Documents

Publication Publication Date Title
CN107516041B (en) WebShell detection method and system based on deep neural network
CN111639344B (en) Vulnerability detection method and device based on neural network
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
WO2020259260A1 (en) Structured query language (sql) injection detecting method and device
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN108170736B (en) Document rapid scanning qualitative method based on cyclic attention mechanism
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
Xiaomeng et al. CPGVA: code property graph based vulnerability analysis by deep learning
CN111600919B (en) Method and device for constructing intelligent network application protection system model
CN111737289B (en) Method and device for detecting SQL injection attack
CN107229563A (en) A kind of binary program leak function correlating method across framework
CN111597803B (en) Element extraction method and device, electronic equipment and storage medium
CN107341399A (en) Assess the method and device of code file security
CN113190849A (en) Webshell script detection method and device, electronic equipment and storage medium
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN114201406B (en) Code detection method, system, equipment and storage medium based on open source component
CN111758098A (en) Named entity identification and extraction using genetic programming
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN109067708B (en) Method, device, equipment and storage medium for detecting webpage backdoor
CN117370980A (en) Malicious code detection model generation and detection method, device, equipment and medium
CN117633811A (en) Code vulnerability detection method based on multi-view feature fusion
CN113971283A (en) Malicious application program detection method and device based on features
CN116226864A (en) Network security-oriented code vulnerability detection method and system
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
Jha et al. Deepmal4j: Java malware detection employing deep learning

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant