CN107516041B - WebShell detection method and system based on deep neural network - Google Patents
WebShell detection method and system based on deep neural network Download PDFInfo
- Publication number
- CN107516041B CN107516041B CN201710705914.1A CN201710705914A CN107516041B CN 107516041 B CN107516041 B CN 107516041B CN 201710705914 A CN201710705914 A CN 201710705914A CN 107516041 B CN107516041 B CN 107516041B
- Authority
- CN
- China
- Prior art keywords
- tree
- abstract syntax
- webshell
- syntax tree
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 93
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 70
- 238000013515 script Methods 0.000 claims abstract description 46
- 230000000306 recurrent effect Effects 0.000 claims abstract description 25
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 44
- 239000013598 vector Substances 0.000 claims description 43
- 238000005070 sampling Methods 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 26
- 238000011176 pooling Methods 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000007906 compression Methods 0.000 claims description 12
- 230000006835 compression Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 230000006403 short-term memory Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 8
- 230000014509 gene expression Effects 0.000 claims description 8
- 230000000670 limiting effect Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000012407 engineering method Methods 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 2
- 230000007787 long-term memory Effects 0.000 claims 6
- 206010042635 Suspiciousness Diseases 0.000 claims 1
- 238000011426 transformation method Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 9
- 238000012546 transfer Methods 0.000 description 9
- 230000009466 transformation Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000000945 filler Substances 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000002155 anti-virotic effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 230000002147 killing effect Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/425—Lexical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a WebShell detection method and a system thereof based on a deep neural network, wherein a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntax information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the feature extraction and WebShell detection comprise preprocessing, sample generation and WebShell detection; firstly, lexical and grammatical information of a script is automatically acquired, and then the characteristic extraction and WebShell detection are completed by using a recurrent neural network based on an abstract syntax tree. The method has the advantages of low deployment cost, good transportability and high detection accuracy.
Description
Technical Field
The invention relates to the technical field of information security, in particular to a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree.
Background
WebShell is a command execution environment in the form of a web page, often used by intruders as a backdoor tool for operating web servers. An attacker obtains the management authority of the Web service through the WebShell, so that penetration and control on Web application are achieved.
Since the characteristics of the WebShell and the common Web page are almost consistent, the detection of the traditional firewall and antivirus software can be avoided. And with the application of various feature confusion hiding technologies for anti-detection to WebShell, a traditional detection mode based on feature code matching is difficult to detect new variants in time.
From the attacker perspective, WebShell is a script Trojan backdoor written by asp, aspx, php or jsp and the like. After an attacker invades a website, the script files are often uploaded to a Web server directory. By means of browser access, the Web server can be controlled while the script file is accessed, for example, data of a website database is read, files on the website server are deleted, and even system branch commands can be directly operated if the Web authority is high.
The existing WebShell detection methods are white box detection, namely detection is carried out on a source code of a WebShell script file, and the existing WebShell detection methods can be specifically divided into two types of detection based on a host and detection based on a network.
Host-based detection: among these methods, the detection method more common in the industry is to directly use known keywords as features, to search suspicious files by grep sentences and then to manually analyze the suspicious files, or to periodically check MD5 values of existing files and to check whether new files are generated. This intuitive detection method is easily circumvented by attackers using obfuscation means.
Network-based detection: the current existing method mainly focuses on configuring an intrusion detection system, namely a WAF, at a network entrance to detect WebShell, and judges whether an attacker uploads HTML or script files by analyzing whether special keywords (e.g., < form, <%, <. The method needs large expenditure and also has the possibility of false alarm; and can only detect the behavior of uploading WebShell, but cannot be detected by the existing WebShell.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of the script mainly through the program language processing technology, and completes feature extraction and WebShell detection by utilizing a deep neural network at the same time, wherein the method is mainly aimed at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like. The system mainly comprises three modules, namely a preprocessing module utilizing a programming language processing technology, a sample generation module completing vectorization expression and a detection module utilizing a deep learning technology. The method has the advantages of low deployment cost, good transportability and high detection accuracy.
The following are several typical neural network model-related term definitions:
the operational formula of the neural network can be defined as:
wherein o is(i)Output vector, o, representing the i-th layer of the neural network(i)The dimension of (a) is the number of neurons (number of network nodes); x is the number of(i)Is the output of the i-1 th layer of the network and is used as the input of the i-th layer; w(i)And b(i)Is a parameter of the i-th layer of the neural network; balanceFor activating functions, in generalIs a non-linear function. This neural network layer is called a fully connected layer (Full connectivity layer).
A Recurrent Neural Network (RNN) is used to process sequence inputs. The recurrent neural network processes one input sequence element at a time while maintaining the historical state of all past time sequence elements with one hidden unit.
The calculation formula of the recurrent neural network layer is as follows:
wherein x istIs an input vector, StIs a hidden unit vector, otIs the output vector; w, U, V is a parameter that is,is an activation function.
The first appearance of the Pooling Layer (Pooling Layer) in convolutional neural networks was a downsampling window that slides through the input matrix. And the pooling layer performs down-sampling in the corresponding matrix sub-area according to the sampling function each time, slides to the next position according to the specified step until the sampling in the whole input matrix is finished, and finally outputs the sampling result matrix to the next layer. The most common sampling methods are maximum sampling, minimum sampling and mean sampling.
The splicing Layer (splicing Layer) is responsible for merging the input k vectors into one output vector, namely:
o=i1&i2&,…,&ikwherein&Are hyphens.
The technical scheme provided by the invention is as follows:
a WebShell detection method based on deep Neural Network, based on the Recurrent Neural Network (AST _ RRNN, Recurrent Neural Network based on abstract syntax tree), utilize the hierarchical structural feature of the abstract syntax tree, carry on WebShell to the script language of the mainstream; the method comprises a pretreatment process, a sample generation process and a detection process; the method specifically comprises the following steps (the scheme flow diagram is specifically shown in figure 1):
A. the script file is firstly preprocessed: the preprocessing module comprises a lexical analyzer, a syntax analyzer and a simplification module, wherein the input is script source codes, and the output is an abstract syntax tree AST (abstract syntax tree), and the specific steps are as follows:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. the lexical analyzer analyzes the lexical unit flow to construct an abstract syntax tree;
A3. and simplifying the steps, namely filtering out semantic irrelevant information such as comments and the like after the step of grammar analysis.
B. And (4) generating a sample. The sample generation module of the WebShell detection method of AST _ RRNN comprises two parts of input contents: simplified AST and AST leaf nodes. However, since the difference in the size of the abstract syntax tree (the number of nodes of the tree) adversely affects the training and prediction of the detection module, the abstract syntax tree needs to be compressed before vectorization. And the sample generation module is responsible for converting the abstract syntax tree into a vectorized representation that facilitates training and prediction by the detection module. The compression steps of the abstract syntax tree are as follows:
B1. the compression of abstract syntax tree mainly utilizes the concepts and methods of n-node sampling sub-tree and m-branch tree transformation to limit the size of abstract syntax tree, and in addition, simple characteristic engineering method is utilized to complete the vectorization representation of leaf nodes.
B2. Vectorized representation of tree nodes. Adopting a One Hot Encoding (One Hot Encoding) method as the most intuitive vectorization Encoding method, adopting a node v of a One Hot Encoding vectorization abstract syntax tree, and marking as One _ Hot (v) to represent the node type of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
C. And (3) detection process: and a deep neural network is adopted as a detection module. And aiming at the tree structure of the abstract syntax tree, a recursive cyclic neural network is adopted. The method comprises the following steps:
C1. aiming at the tree structure, the scheme defines a neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The recursion _ LSTM layer exploits the Recursive nature of trees whose vector representation is generated by some non-linear operation from the vector representations of their root nodes and subtree sets.
C2. The vectorized representation of the root node in the tree structure is the same as the vectorized representation of the tree node in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the recursive long-short term memory layer. Let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C1,c2,…,ci,…,c|c|Is a set of corresponding subtreesWherein c isiIs composed ofThe root node of (2). The calculation formula of the T vectorization expression is formula 1:
wherein,representing an activation function, Wroot、WpickupAnd WsubtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the recurved _ LSTM layer represented by the vectorization representation of each m-ary tree in F, and is represented by formula 2:
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees representing intermediate nodes of the abstract syntax tree; 2) and the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:
c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as FeatureR=[f1,f2,…,fk]T。
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to process FeatureRDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Featurep=[fmax,fmin,fmean]T。
C33. Splicing Layer (ligation Layer) will FeaturepAnd the vector f corresponding to the leaf featuresSpliced into a vector, FeatureA=fmax&fmin&fmean&fs,(&A representation tile); featureAIs for the entropy of information, longestThe words,
Representing a coincidence index, a compression ratio and a danger function;
C34. subsequent full connectivity layer utilization FeatureAAnd performing WebShell judgment.
Specifically, given a decision threshold, when FeatureAAnd when the judgment threshold limit is exceeded, the file is identified as the WebShell file.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, the accuracy Precision is set as U, the Recall rate is set as V, and the accuracy U is the number of correct extracted information pieces/the number of extracted information pieces, which is also called Precision; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. In the implementation of the invention, a Precision-Recall curve is drawn to help the adjustment analysis and the parameters are selected.
The invention discloses a recursive recurrent neural network detection method based on an abstract syntax tree. First, the script code is converted into an abstract syntax tree using a lexical analyzer. Then, a compression algorithm of the abstract syntax tree is invented. And finally, aiming at the structural characteristics of the abstract syntax tree, the invention provides a recurrent neural network model as a detection module.
In another aspect, the present invention further provides a WebShell detection system of a recurrent neural network based on an abstract syntax tree, where the system includes:
1. the preprocessing module is used for outputting an abstract syntax tree by using a syntax analyzer as a preprocessing module and using script source codes as input through syntax analysis;
2. the sample generation module, the sample generation module of the AST _ RRNN method, is responsible for converting the abstract syntax tree into a vector representation that facilitates the training and prediction of the detection module. The vectorized representation of the abstract syntax tree is divided into two parts: 1) and performing vectorization representation on the leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method. 2) Designing a sampling algorithm to limit the scale of an abstract syntax tree part consisting of intermediate nodes, wherein the basic idea is to replace an original abstract syntax tree by utilizing a group of sampling subtrees with smaller scale;
3. the detection module is a deep neural network model, and constructs a recurrent neural network (RRNN). The user-defined Recursive long-short term memory layer recursion _ LSTM provides a bottom-up operation mode for the tree structure. Because the input of the RRNN is k tree structures, the bottom end of the RRNN is k recurrent _ LSTM layers sharing parameters, the operation result is spliced with the vector expression of the leaf nodes after being processed by the pooling layer, and finally the operation result is input into a subsequent full-connection layer.
The invention has the beneficial effects that:
the invention provides a WebShell detection method and a WebShell detection system of a recurrent neural network based on an abstract syntax tree. The invention introduces a program language processing technology and a deep learning technology at the same time, automatically acquires lexical and grammatical information of a script by the program language processing technology aiming at mainstream script languages, including PHP, JavaScript, Perl, Python, Ruby and the like, and completes feature extraction and WebShell detection by utilizing a deep neural network. The WebShell detection by using the technical scheme provided by the invention has the following advantages:
1) the features are automatically extracted, and the dependence on feature engineering is avoided;
2) the portability is good, and the thought and the flow are suitable for any scripting language;
3) the static detection method can be deployed at a Web server side in a light weight mode, and is low in deployment and detection cost;
4) the detection accuracy is high: compared with various test modes, the WebShell detection method based on the recurrent neural network of the abstract syntax tree can effectively deal with some relatively new WebShell type files (such as 0dayWebShell), and has good searching and killing effects on some deformed, encrypted and existing WebShell files.
Drawings
Fig. 1 is a flow chart of the WebShell detection method provided by the present invention.
Fig. 2 is a block diagram of a flow of a preprocessing module in the WebShell file detection process in the embodiment of the present invention.
Fig. 3 is a block flow diagram of a sample generation module in the WebShell file detection process in the embodiment of the present invention.
Fig. 4 is a block diagram of a flow of a detection module in the WebShell file detection process in the embodiment of the present invention.
Fig. 5 is a block diagram of the system provided by the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a method and a system for detecting a WebShell (WebShell) based on a recurrent neural network of an abstract syntax tree, wherein the system comprises a preprocessing module, a sample generating module and a detecting module; the detection of the WebShell file of the website is realized through the above process, and the specific implementation mode of the invention is as follows (in this example, PHP script is taken as an example for explanation, and other types of script languages are the same):
A. the preprocessing module, this part includes lexical analyzer, syntax analyzer, simplification step, specifically as follows (as fig. 2):
A1. a lexical analyzer, a PHP type file F containing program codes (script source codes), and generating a lexical unit stream WS after the lexical analysis;
A2. and (4) utilizing a syntax analyzer PHP-parser to perform syntax analysis on the WS to construct an abstract syntax tree AST.
The parsing process typically filters out semantically irrelevant information, such as annotations. The syntactic analysis is based on lexical analysis, which is more stringent than the lexical analysis rules. Meanwhile, compared with the lexical unit stream, the abstract syntax tree can reflect the code structure information more accurately.
A3. The abstract syntax tree is simplified, the structure of the abstract syntax tree after PHP-pharse analysis is clear, but the abstract syntax tree is slightly redundant, the abstract syntax tree needs to be simplified, and the simplification steps are as follows:
A31. deleting all leaf nodes of the abstract syntax tree, and simultaneously, in order to not lose leaf node characteristics, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method through a sample generation module;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
B. Sample generation module (as detailed in fig. 3):
B1. in the compression of the abstract syntax tree, since the difference in the size of the abstract syntax tree (the nodes of the tree) adversely affects the training and prediction of the detection module, it is necessary to compress the abstract syntax tree before vectorizing the abstract syntax tree. The abstract syntax tree is limited in size mainly by the concept and method of n-node sampling sub-tree and m-tree transformation. The specific compression steps are as follows:
B11. and repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T (V, E). The result returned by the n-node sampling subtree algorithm is called a sampling subtree. This step therefore ultimately produces a set of sampled subtrees of size K, denoted as Wherein for anySatisfy the requirement ofDoes not exceed n.
B12. At FsampleIn which a subset F of size k is determinedselectI.e. Fselect∈FsampleAnd | FselectI | ═ k, such thatSatisfies the following conditions:
wherein the T () function is a value evaluation function for evaluating the argument FsubThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation FsubThe amount of information that can contribute to the WebShell detection conclusion. The form and meaning of the T () function need to be customized, and the value evaluation function is defined as:
wherein, ω is1、ω2、ω3The values are all set to be 1 in the scheme; sigma (),pi () is three value ranges of [0,1 ]]Respectively, for measuring FselectCoverage, suspicion, and variability. In the scheme, T () is the linear sum of three types of index values. Wherein:
coverage σ () function, in this scheme F is expectedselectContains as many nodes as possible in T to obtain more information of T. Thus, the coverage is defined as FselectAt the ratio of the size of the set of nodes to | V |, the formula:
Suspicion ofA function. Suppose thatAre two n-node sampled sub-trees of the abstract syntax tree T. If it isJust corresponding to the WebShell functional part in the source code,the corresponding malicious-free obfuscated code portion, in the WebShell detection problem,obvious ratioMore has "suspicious degree". Similarly, for node viRatio vjMore "suspicion degree", therefore, the suspicion degree of the node v is defined as:
wherein, cvWebShell represents the number of times v appears in all WebShell scripts in the training set, cvAll indicates the number of times v occurs in all scripts in the training set. Defining n-node sampling subtree TsampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
accordingly, define FselectIs FselectAverage value of suspicion degree of all n-node sampling subtrees:
a diversity pi () function.If it isAndnode types and structures are almost the same, then setIs unlikely to provide a ratioMore useful information. Therefore, F is desiredselectThe sampled subtrees in (1) are as dissimilar as possible. Definition FselectThe diversity of (A) is:
wherein, the Tree _ Diversity () is a Tree distance algorithm, and the distance between two trees is calculated.
The Tree distance algorithm Tree _ Diversity () is specifically as follows:
B13. to pairAll the sampled subtrees in (1), perform an m-ary tree transformation algorithm, denoted asThe m-ary tree transformation algorithm limits the size of the child nodes of any tree to m, and ensures that the size of the tree with any size of n is smaller than 2n after the m-ary tree transformation is carried out on the tree with any size of n. F is to betransferInstead of the abstract syntax tree T, as input to the detection module.
And (5) m-ary tree transformation algorithm. The idea is as follows: if the child node set C of the node v exceeds the size limit, namely | C | > m, a layer of filling nodes is added between v and C until the child node size of v meets the limit. The filling node only reduces the size of the child node, does not contain any syntactic semantic information, and is defined as a 0 vector when the vector is represented. The specific algorithm is as follows:
let T besampleThe sub-tree is sampled for n nodes, and the tree T is obtained after m-way tree transformationtransferObviously, the size of the child node is not larger than m. The filling nodes introduced in the m-ary tree transformation process are all TtransferSo that the number of padding nodes is necessarily less than | VsampleL, i.e. TtransferIs less than 2| Vsample|。
At this point, the compression process for the abstract syntax tree is complete.
B2. Vectorized representation of tree nodes. The vectorization representation of the tree nodes adopts a One hot encoding (One hot encoding) method, the method is the most intuitive vectorization encoding method, and the nodes v of the vectorization abstract syntax tree adopting the One hot encoding are marked as One _ hot (v) and represent the node types of the v; and (3) adopting a Bag of words model (Bag of Word) vectorized abstract syntax tree T, and recording as BoW (T) to represent the number of each type of node in the T.
Let T betransfer={Vtransfer,EtransferAnd generating an n-node sampling sub-tree of T ═ V, E through m-ary tree transformation. To VtransferOf arbitrary non-filler nodes v, T and TtransferSubtrees T with v as root node(v)Andin the vectorization process, the vectorized representation of v consists of two parts: the first part is used for representing the type of the node v, and adopts a one-hot coding mode, and is marked as:
Encode(v)=one_hot(v)
the second part being intended to represent T(v)Is not sampled, andthe calculation formula of the node set of nodes that have not been "picked back" is:
for filler nodes, both partial representations are specified as 0 vectors.
B3. Vectorized representation of leaf nodes, the leaf nodes are all Scalar (Node _ Scalar) types, including integers, floating point numbers, character strings, and the like. The method only focuses on the Scalar node (Scalar _ String) of the character String, and extracts danger function characteristics and character String statistical characteristics from the Scalar node of the character String.
And establishing a danger function list aiming at the script language, and searching whether the character string scalar node contains a danger function field or not by comparing the danger function list. And the risk function characteristics are vectorized and expressed by a bag-of-words model. The length of the feature vector is equal to the length of the list of the danger functions; the statistical characteristics of the character strings are interpreted from the mathematical point of view, and after the character strings are subjected to fuzzy processing such as confusion, encoding, encryption and the like, certain mathematical statistics of the character strings are usually deviated from the probability distribution of the character strings in a normal script. This is also the rationale for the NeoPi (an open source tool published by Neohapsis on github) method. The NeoPi is a script tool written by Python, detects malicious codes existing in texts and script files by using various statistical methods, and mainly detects the malicious codes by extracting the information entropy, the longest word, the coincidence index, the characteristics and the compression ratio of the files. The method selects 4 important indexes of the character string length, the coincidence index, the information entropy and the file compression ratio in the NeoPi method, and inspects each character string constant in the script.
String Length (Length of String). The string constants in the normal code are concise, and the code fragments are embedded into the string constants by part of WebShell, so that long strings are more likely to appear in the WebShell script compared with normal scripts.
Coincidence Index (Index of Coincidence). The coincidence index is one way to determine whether a file is encrypted or encoded. The calculation formula is as follows:
IC(s)=∑(fi*fi-1)/N*(N-1)
wherein f isiRepresenting the frequency of occurrence of the character i in the string s in the sample, and N is the length of the string. Statistics show that the coincidence index of a meaningful english text is 0.0667, while the index of a completely random english string is 0.0385. That is, when the coincidence index of an english string is close to 0.0385, we just tend to consider it encrypted or encoded, thereby further inferring that the script is likely to be WebShell.
Entropy of Information (Entropy of Information). Information entropy is a basic concept in information theory and is a measure of the degree of system ordering. The calculation formula is as follows:
H(s)=-∑pi*logpi
wherein p isiThe proportion of the string i that appears in the string s. Therefore, when the character string is pseudo-randomized by encryption or coding, the information entropy increases, and therefore the larger the information entropy value is, the higher the possibility of WebShell is.
The file compression ratio, which is defined as the ratio of the uncompressed file size to the compressed file size. The essence of data compression is to eliminate the imbalance in the distribution of specific characters, and to achieve length optimization by assigning short codes to high frequency characters, while low frequency characters use long codes. A web page file encoded by base64, with non-ASCII characters removed, will appear as a smaller distribution imbalance, with the larger compression ratio, calculated as follows:
wherein zip () represents compressing data and length () represents calculating data length.
C. A detection module (see fig. 4). The method adopts a deep neural network as a detection module, adopts a recursion cycle neural network aiming at a tree structure of an abstract syntax tree, and comprises the following concrete steps:
C1. aiming at the tree structure, the scheme defines a new neural network layer: recursive Long Short Term Memory Layer (Recursive _ LSTM, Recursive Long Short Term Memory Layer). The basic idea of the recurve _ LSTM layer is: using the recursive nature of the tree, a vector representation of the tree is generated by some non-linear operation from vector representations of its root node and set of subtrees.
C2. The vectorized representation of the root node is identical to the vectorized representation of the tree nodes in B2; the vector representation of the subtree set is computed by inputting the subtrees sequentially into the long-short term memory layer. Formally, let the root node of the tree T ═ (V, E) be r, and the set of child nodes in r be C ═ C1,c2,…,ci,…,c|c|Is a set of corresponding subtreesWherein c isiIs composed ofThe root node of (2). The calculation formula of the T vectorization representation is:
wherein,representing an activation function, Wroot、WpickupAnd WsubtreeIs a parameter. Encode (F) is the final output result of sequentially inputting the vectorized representation of each m-ary tree in F into the LSTM layer, and is represented as:
C3. a Recursive Recurrent Neural Network (RRNN) is designed as a detection module by utilizing a recursion _ LSTM layer. The input to the RRNN includes two parts: 1) k vectorized m-ary trees represent intermediate nodes of the abstract syntax tree; 2) the fixed-length vectors represent leaf nodes of the abstract syntax tree. The operation of RRNN is described as follows:
c31.RRNN bottom by k sharing weight's recurve _ LSTM layer, corresponding processing k m-tree, through the operation output k x d dimension characteristic, is marked as FeatureR=[f1,f2,…,fk]T。
C32. The Pooling Layer (Pooling Layer) uses three down-sampling functions of maximum value, minimum value and mean value to F at the same timeRDownsampling (pooling) operations were performed in columns. Thus, the pooling layer outputs 3 d-dimensional vectors, denoted Featurep=[fmax,fmin,fmean]T。
C32. Splicing Layer (splicing Layer) FpAnd the vector f corresponding to the leaf featuresSpliced into a vector, FeatureA=fmax&fmin&fmean&fs。(&Express splice mark)
C34. Subsequent full connectivity layer utilization FeatureAAnd performing WebShell judgment.
The judgment threshold value needs to be obtained through training, and is adjusted according to the accuracy and the recall rate and is not a fixed value. In the process of training the judgment threshold, setting the accuracy as U, the recall rate as V, and the accuracy U being the number of correct extracted information pieces/the number of extracted information pieces, which is also called precision rate; the recall ratio V is the number of correct extracted information pieces/number of information pieces in a sample, and is also called recall ratio. The precision ratio and the recall ratio both take values between 0 and 1, and the closer the value is to 1, the higher the precision ratio or the recall ratio is. The decision threshold may be adjusted based on accuracy and recall. Generally, the accuracy Precision is how many retrieved items are accurate, and the Recall is how many all accurate items are retrieved. In practice, it is of course desirable that the search result Precision is as high as possible, and that the Recall is as high as possible, but in fact, the two are in some cases contradictory. For example, in an extreme case, only one result is searched in the experiment and is accurate, then Precision is 100%, but Recall is very low; if all results are returned, then for example Recall is 100%, but Precision is low. Therefore, in different situations, it is necessary to judge whether a higher Precision or a higher Recall is desired.
The details of RRNN are shown in Table 1-1. In the training process, a binary cross information entropy function is adopted as a loss function, a method SGD (stochastic gradient descent) is adopted as a training method, the number of samples in each batch of training is 32, and the training iteration number is 1000.
TABLE 1 AST _ RRNN method detailed description of detection module parameters
The invention is further illustrated by the following examples.
Example (b):
the scheme adopts supervised training, and the mainstream method for training the deep neural network is a random gradient descent (SGD) method and a deformation form thereof. The method inputs a group of training samples into the neural network every time, and updates parameters of the neural network by using the value of the objective function until the value of the objective function is converged. The specific updating method is to move all parameters in the neural network by a small step along the direction of gradient decrease of the objective function (the opposite direction of the derivative).
The sample set of the example is selected, and the sample set contains a large number of normal scripts and 6669 WebShell scripts. 100000 scripts are drawn from the normal sample set for training the token's word vector. The remaining normal scripts are randomly extracted 6669, and form a training set of classification problems together with all WebShell scripts.
Tables 1-2 example training set and test set partitioning of data sets
Training set | Test set | Total of | |
WebShell script | 5187 | 1482 | 6669 |
Normal script | 5187 | 1482 | 6669 |
1) Firstly, using the lexical analysis results of 100000 PHP scripts as input;
2) generating an abstract syntax tree by using the PHP-parser;
3) the determination of 4 key parameters in the sample generation module is ① n, namely the scale of a limiting tree, ② m, namely the scale of sub nodes of the limiting tree, ③ K, namely the size of a sampling subtree set, ④ K, namely the number of final input m-ary trees, in the RRNN training process, when any abstract syntax tree T is (V, E), when a sample is constructed, n is fixed to 1000, m is fixed to 10, K is min (50,) And K is min (K, 10). And when the training of the RRNN model is finished, fixing 3 parameter values in each training process, respectively taking different values for the residual variables, and recording the detection result.
4) In the test process, it is found that the detection effect of the AST _ RRNN method is generally improved when the values of n, m, K, and K are increased. Therefore, in the detection process, the values of the 4 parameters can be properly increased according to the size of the abstract syntax tree, so that the detection accuracy is improved.
The AST _ RRNN method utilizes two types of features, ① features extracted from leaf nodes, ② features extracted from abstract syntax trees, and retrains the parameters that adjust the RRNN based on the trained RRNN using the leaf node features and the abstract syntax tree features, respectively.
1) Accuracy 0.9886 using leaf node features and abstract syntax tree features as input;
2) accuracy 0.7649 when only leaf node features are used;
3) accuracy 0.8659 when only abstract syntax tree features are used;
the detection effect of the abstract syntax tree as the characteristic is obviously higher than that of the detection result of the leaf node as the characteristic, which shows that the structural information in the abstract syntax tree is important for WebShell detection. Moreover, only by using a single feature, both the abstract syntax tree and the leaf node can cause the accuracy rate to be reduced by at least 10%, the phenomenon is that the leaf node features describe key information of a data transmission part, the abstract syntax tree is an accurate description of a data execution part, and the two functions together guarantee the detection result of the AST _ RRNN method.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (10)
1. A WebShell detection method based on a deep neural network is characterized in that a recursive cyclic neural network based on an abstract syntax tree automatically acquires lexical and syntactic information of a script aiming at a script language, and completes feature extraction and WebShell detection by utilizing hierarchical structural features of the abstract syntax tree, wherein the WebShell detection method comprises a preprocessing process, a sample generation process and a detection process; the method specifically comprises the following steps:
A. the script file preprocessing process comprises the following steps:
the input is script source code, the preprocessing comprises lexical analysis, syntactic analysis and simplification, and the output is abstract syntax tree T ═ (V, E), wherein V is a set of leaf nodes in T, and E is a set of edges in T;
B. and (3) a sample generation process:
inputting leaf nodes comprising the simplified abstract syntax tree and the abstract syntax tree; the method comprises the following steps: compressing the abstract syntax tree and vectorizing the abstract syntax tree, wherein the vectorized abstract syntax tree comprises vectorized expressions of tree nodes and leaf nodes;
C. adopting a deep neural network to carry out WebShell detection: aiming at the tree structure of the abstract syntax tree, the deep neural network adopts a recursion cycle neural network; the method comprises the following steps:
C1. defining a neural network layer as a recursive long-term and short-term memory layer aiming at a tree structure, wherein the recursive long-term and short-term memory layer utilizes the recursive characteristic of the tree and is represented by vectors of a root node and a subtree set of the tree, and the vectors of the tree are represented by nonlinear operation;
C2. the vectorization representation method of the root node in the tree structure adopts the same method as the vectorization representation of the tree node in the step B; vector representation of a sub-tree set in the tree structure is generated by inputting sub-trees into a recursive long-short term memory layer in sequence;
C3. designing a recurrent neural network RRNN as a detection module by utilizing the recurrent long and short term memory layer;
the inputs to the RRNN include: k vectorized m-order multi-way trees representing intermediate nodes of the abstract syntax tree; a fixed-length vector representing a leaf node of the abstract syntax tree; the operation process of the RRNN comprises the following steps:
the bottom of RRNN comprises k recursive long and short term memory layers sharing weight, corresponding to k m-trees, and outputting k x d dimensional characteristics through calculation, which is recorded as FeatureR=[f1,f2,…,fk]T,fkIs k d-dimension feature vector;
c32.RRNN pooling layerSimultaneously using three down-sampling functions of maximum value, minimum value and mean value to FeatureRThe down sampling operation is carried out according to the columns, and the pooling layer outputs three d-dimensional vectors which are marked as Featurep=[fmax,fmin,fmean]TWherein f ismaxD-dimensional vector, f, output using maximum sampling function for pooling layersminD-dimensional vector, f, output using minimum sampling function for pooling layersmeanD-dimensional vectors output by using a mean sampling function for the pooling layer;
RRNN splice layer FeaturepAnd vector f corresponding to leaf featuresSpliced into a vector to obtain a spliced Feature vector FeatureA=fmax&fmin&fmean&fs,&Representing a splice;
c34.RRNN full connectivity layer utilizes Feature vector FeatureAAnd performing WebShell judgment.
2. The WebShell detection method as recited in claim 1, wherein the step of preprocessing the script file specifically comprises:
A1. performing lexical analysis on the program codes to generate a lexical unit stream;
A2. carrying out lexical analysis on the lexical unit flow to construct an abstract syntax tree;
A3. and filtering the lexical unit streams subjected to the syntactic analysis to remove semantic irrelevant information so as to achieve the purpose of simplifying the abstract syntax tree.
3. The WebShell detection method of claim 2, wherein the step a3 of simplifying the abstract syntax tree comprises the steps of:
A31. deleting all leaf nodes of the abstract syntax tree, simultaneously, carrying out vectorization processing on the leaf nodes by adopting a simple characteristic engineering method when generating a sample, wherein the leaf node characteristics are not lost;
A32. intermediate nodes of the abstract syntax tree retain only the declaration, expression and scalar node types, ignoring the auxiliary types.
4. The WebShell detection method of claim 1, wherein the step B sample generation process comprises:
B1. compression of abstract syntax trees: limiting the size of the abstract syntax tree by using an n-node sampling sub-tree and an m-ary tree transformation method; vectorization representation of the leaf nodes can be completed by utilizing a characteristic engineering method;
B2. vectorization represents tree nodes: the vectorization coding method adopts a one-hot coding method, and adopts a node v of a one-hot coding vectorization abstract syntax tree, which is marked as one _ hot (v) and represents the node type of v; adopting a bag-of-words model vectorization abstract syntax tree T, recording as BoW (T), and representing the number of each type of nodes in the T;
B3. vectorization represents a leaf node: and extracting the leaf nodes of the character string scalar type to obtain a danger function characteristic and a character string statistical characteristic.
5. The WebShell detection method of claim 4, wherein step B1 includes the following steps:
B11. repeatedly calling the n-node sampling subtree algorithm for K times for any abstract syntax tree T ═ V, E, generating a sampling subtree set with the size of K, and recording the sampling subtree set as KWherein for any Satisfy the requirement ofThe scale of (a) does not exceed n;sampling a sub-tree generated by a sub-tree algorithm for the nth node at the Kth time;
B12. at FsampleIn order to obtain a sizeSubset F of kselectI.e. Fselect∈FsampleAnd | FselectI | ═ k, such thatSatisfies the following formula:
wherein the T () function is a value evaluation function for evaluating the argument FsubThe expressed sampling subtree set judges the 'value' of the conclusion to WebShell or understands the evaluation FsubThe amount of information which can contribute to the WebShell detection conclusion is obtained; the value evaluation function T () is defined as follows:
wherein, ω is1、ω2、ω3Is a constant; coverage function σ (), suspicion functionThe diversity function pi () has three value ranges as the interval [0,1 ]]Respectively, for measuring FselectCoverage, suspicion, and variability of; the cost evaluation function T () is a linear sum of three types of metric values.
6. The WebShell detection method of claim 5, wherein, in the definition of the value evaluation function T (),
wherein, ω is1、ω2、ω3Is a constant; preferably, ω is1、ω2、ω3Are all set to 1.
7. The WebShell detection method of claim 5, wherein the coverage function σ () is FselectRatio of size of node set to | V |:
function of doubtabilityIs defined as: suppose thatIs two n-node sampling subtrees of the abstract syntax tree T; if it isJust corresponding to the WebShell functional part in the source code,and the corresponding part of the obfuscated code without malice is detected by the WebShell,ratio ofMore has 'doubtful degree'; for node viRatio vjMore has 'doubtful degree'; the suspiciousness of the node v is defined as:
wherein, cvThe WebShell represents the number of times that v appears in all WebShell scripts in the training set; c. CvAll means v is trainingTraining the occurrence frequency of all scripts; defining n-node sampling subtree TsampleThe suspicion degree of (2) is the mean value of the suspicion degrees of all nodes:
accordingly, define FselectIs FselectAverage value of suspicion degree of all n-node sampling subtrees:
the definition of the diversity function pi () is specifically: definition FselectThe diversity of (a) is as follows:
wherein, the distance between two trees is calculated through Tree _ Diversity ().
8. The WebShell detection method of claim 1, wherein in step C2, assuming that the root node of the tree T ═ (V, E) is r, the set of child nodes in r is C ═ C1,c2,…,ci,…,c|c|Is the corresponding subtree set Wherein c isiIs composed ofA root node of; c. C|c|The last child node of the root node r; the tree T is represented vectorized by equation 1:
9. The WebShell detection method of claim 1, wherein in step C34, the fully-connected layer utilizes Feature vectors FeatureAMaking a WebShell decision, specifically, giving a decision threshold when FeatureAAnd when the judgment threshold is exceeded, identifying the file as the WebShell file.
10. The WebShell detection system realized by the WebShell detection method of claims 1-9 comprises a preprocessing module, a sample generation module and a detection module;
the preprocessing module takes script source codes as input by using a syntax analyzer and outputs an abstract syntax tree through syntax analysis;
the sample generation module is configured to translate an abstract syntax tree into vector expressions that facilitate training and prediction by a detection module, and includes: performing vectorization representation on leaf nodes by adopting feature engineering and utilizing a simple matching rule and a statistic calculation method; limiting the scale of an abstract syntax tree part consisting of intermediate nodes through a sampling algorithm, and replacing an original abstract syntax tree by using a group of sampling subtrees with smaller scale;
the detection module is a deep neural network model, constructs a recurrent neural network, self-defines a recurrent long-term and short-term memory layer and provides bottom-up operation on the tree structure; the bottom end of the recurrent neural network is a recurrent _ LSTM layer with k shared parameters, the input of the recurrent _ LSTM layer is k tree structures, the operation result is spliced with the vector expression of the leaf nodes after being processed by a pooling layer, and finally the operation result is input into a full connection layer for WebShell detection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710705914.1A CN107516041B (en) | 2017-08-17 | 2017-08-17 | WebShell detection method and system based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710705914.1A CN107516041B (en) | 2017-08-17 | 2017-08-17 | WebShell detection method and system based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107516041A CN107516041A (en) | 2017-12-26 |
CN107516041B true CN107516041B (en) | 2020-04-03 |
Family
ID=60723188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710705914.1A Active CN107516041B (en) | 2017-08-17 | 2017-08-17 | WebShell detection method and system based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107516041B (en) |
Families Citing this family (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108376283B (en) * | 2018-01-08 | 2020-11-03 | 中国科学院计算技术研究所 | Pooling device and pooling method for neural network |
CN108388425B (en) * | 2018-03-20 | 2021-02-19 | 北京大学 | Method for automatically completing codes based on LSTM |
US11132444B2 (en) * | 2018-04-16 | 2021-09-28 | International Business Machines Corporation | Using gradients to detect backdoors in neural networks |
CN110502897A (en) * | 2018-05-16 | 2019-11-26 | 南京大学 | A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis |
CN109101235B (en) * | 2018-06-05 | 2021-03-19 | 北京航空航天大学 | Intelligent analysis method for software program |
CN108898015B (en) * | 2018-06-26 | 2021-07-27 | 暨南大学 | Application layer dynamic intrusion detection system and detection method based on artificial intelligence |
CN108985061B (en) * | 2018-07-05 | 2021-10-01 | 北京大学 | Webshell detection method based on model fusion |
CN109120617B (en) * | 2018-08-16 | 2020-11-17 | 辽宁大学 | Polymorphic worm detection method based on frequency CNN |
CN109240922B (en) * | 2018-08-30 | 2021-07-09 | 北京大学 | Method for extracting webshell software gene to carry out webshell detection based on RASP |
CN109462575B (en) * | 2018-09-28 | 2021-09-07 | 东巽科技(北京)有限公司 | Webshell detection method and device |
CN109657466A (en) * | 2018-11-26 | 2019-04-19 | 杭州英视信息科技有限公司 | A kind of function grade software vulnerability detection method |
CN109635563A (en) * | 2018-11-30 | 2019-04-16 | 北京奇虎科技有限公司 | The method, apparatus of malicious application, equipment and storage medium for identification |
CN109684844B (en) * | 2018-12-27 | 2020-11-20 | 北京神州绿盟信息安全科技股份有限公司 | Webshell detection method and device, computing equipment and computer-readable storage medium |
CN109905385B (en) * | 2019-02-19 | 2021-08-20 | 中国银行股份有限公司 | Webshell detection method, device and system |
CN111614599B (en) * | 2019-02-25 | 2022-06-14 | 北京金睛云华科技有限公司 | Webshell detection method and device based on artificial intelligence |
CN111611150B (en) * | 2019-02-25 | 2024-03-22 | 北京搜狗科技发展有限公司 | Test method, test device, test medium and electronic equipment |
CN109933602B (en) * | 2019-02-28 | 2021-05-04 | 武汉大学 | Method and device for converting natural language and structured query language |
CN110086788A (en) * | 2019-04-17 | 2019-08-02 | 杭州安恒信息技术股份有限公司 | Deep learning WebShell means of defence based on cloud WAF |
CN110232280B (en) * | 2019-06-20 | 2021-04-13 | 北京理工大学 | Software security vulnerability detection method based on tree structure convolutional neural network |
CN110362597A (en) * | 2019-06-28 | 2019-10-22 | 华为技术有限公司 | A kind of structured query language SQL injection detection method and device |
CN110855661B (en) * | 2019-11-11 | 2022-05-13 | 杭州安恒信息技术股份有限公司 | WebShell detection method, device, equipment and medium |
CN111198817B (en) * | 2019-12-30 | 2021-06-04 | 武汉大学 | SaaS software fault diagnosis method and device based on convolutional neural network |
CN113094706A (en) * | 2020-01-08 | 2021-07-09 | 深信服科技股份有限公司 | WebShell detection method, device, equipment and readable storage medium |
CN111741002B (en) * | 2020-06-23 | 2022-02-15 | 广东工业大学 | Method and device for training network intrusion detection model |
CN112118225B (en) * | 2020-08-13 | 2021-09-03 | 紫光云(南京)数字技术有限公司 | Webshell detection method and device based on RNN |
CN112035099B (en) * | 2020-09-01 | 2024-03-15 | 北京天融信网络安全技术有限公司 | Vectorization representation method and device for nodes in abstract syntax tree |
CN112132262B (en) * | 2020-09-08 | 2022-05-20 | 西安交通大学 | Recurrent neural network backdoor attack detection method based on interpretable model |
CN112487368B (en) * | 2020-12-21 | 2023-05-05 | 中国人民解放军陆军炮兵防空兵学院 | Function level confusion detection method based on graph convolution network |
CN113190849B (en) * | 2021-04-28 | 2023-03-03 | 重庆邮电大学 | Webshell script detection method and device, electronic equipment and storage medium |
US20220405572A1 (en) * | 2021-06-17 | 2022-12-22 | Cylance Inc. | Methods for converting hierarchical data |
CN113810375B (en) * | 2021-08-13 | 2023-01-20 | 网宿科技股份有限公司 | Webshell detection method, device and equipment and readable storage medium |
CN114462033A (en) * | 2021-12-21 | 2022-05-10 | 天翼云科技有限公司 | Method and device for constructing script file detection model and storage medium |
CN114499944B (en) * | 2021-12-22 | 2023-08-08 | 天翼云科技有限公司 | Method, device and equipment for detecting WebShell |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101895420A (en) * | 2010-07-12 | 2010-11-24 | 西北工业大学 | Rapid detection method for network flow anomaly |
CN103971054A (en) * | 2014-04-25 | 2014-08-06 | 天津大学 | Detecting method of browser extension loophole based on behavior sequence |
CN105069355A (en) * | 2015-08-26 | 2015-11-18 | 厦门市美亚柏科信息股份有限公司 | Static detection method and apparatus for webshell deformation |
-
2017
- 2017-08-17 CN CN201710705914.1A patent/CN107516041B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101895420A (en) * | 2010-07-12 | 2010-11-24 | 西北工业大学 | Rapid detection method for network flow anomaly |
CN103971054A (en) * | 2014-04-25 | 2014-08-06 | 天津大学 | Detecting method of browser extension loophole based on behavior sequence |
CN105069355A (en) * | 2015-08-26 | 2015-11-18 | 厦门市美亚柏科信息股份有限公司 | Static detection method and apparatus for webshell deformation |
Also Published As
Publication number | Publication date |
---|---|
CN107516041A (en) | 2017-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107516041B (en) | WebShell detection method and system based on deep neural network | |
CN111639344B (en) | Vulnerability detection method and device based on neural network | |
CN111428044B (en) | Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes | |
WO2020259260A1 (en) | Structured query language (sql) injection detecting method and device | |
CN108446540B (en) | Program code plagiarism type detection method and system based on source code multi-label graph neural network | |
CN108170736B (en) | Document rapid scanning qualitative method based on cyclic attention mechanism | |
CN113596007B (en) | Vulnerability attack detection method and device based on deep learning | |
Xiaomeng et al. | CPGVA: code property graph based vulnerability analysis by deep learning | |
CN111600919B (en) | Method and device for constructing intelligent network application protection system model | |
CN111737289B (en) | Method and device for detecting SQL injection attack | |
CN107229563A (en) | A kind of binary program leak function correlating method across framework | |
CN111597803B (en) | Element extraction method and device, electronic equipment and storage medium | |
CN107341399A (en) | Assess the method and device of code file security | |
CN113190849A (en) | Webshell script detection method and device, electronic equipment and storage medium | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN114201406B (en) | Code detection method, system, equipment and storage medium based on open source component | |
CN111758098A (en) | Named entity identification and extraction using genetic programming | |
CN115033890A (en) | Comparison learning-based source code vulnerability detection method and system | |
CN109067708B (en) | Method, device, equipment and storage medium for detecting webpage backdoor | |
CN117370980A (en) | Malicious code detection model generation and detection method, device, equipment and medium | |
CN117633811A (en) | Code vulnerability detection method based on multi-view feature fusion | |
CN113971283A (en) | Malicious application program detection method and device based on features | |
CN116226864A (en) | Network security-oriented code vulnerability detection method and system | |
CN111562943B (en) | Code clone detection method and device based on event embedded tree and GAT network | |
Jha et al. | Deepmal4j: Java malware detection employing deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |