CN112800427B - Webshell detection method and device, electronic equipment and storage medium - Google Patents

Webshell detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112800427B
CN112800427B CN202110374845.7A CN202110374845A CN112800427B CN 112800427 B CN112800427 B CN 112800427B CN 202110374845 A CN202110374845 A CN 202110374845A CN 112800427 B CN112800427 B CN 112800427B
Authority
CN
China
Prior art keywords
sequence
feature vector
character string
token sequence
token
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110374845.7A
Other languages
Chinese (zh)
Other versions
CN112800427A (en
Inventor
徐国爱
徐国胜
王晨宇
王浩宇
程柏钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110374845.7A priority Critical patent/CN112800427B/en
Publication of CN112800427A publication Critical patent/CN112800427A/en
Application granted granted Critical
Publication of CN112800427B publication Critical patent/CN112800427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

One or more embodiments of the present specification provide a webshell detection method, apparatus, electronic device, and storage medium, where the method includes: analyzing the PHP source code to obtain a token sequence and a character string constant sequence; symbolizing the token sequence to obtain the symbolized token sequence; vectorizing the signed token sequence to obtain a token sequence feature vector; vectorizing the character string constant sequence to obtain a character string constant sequence feature vector; processing the token sequence characteristic vector and the character string constant sequence characteristic vector by using a webshell detection model to obtain a webshell detection result; the method has stronger detection capability of detecting the confused sample, reduces the false alarm rate of the white sample, and integrally improves the accuracy rate of the webshell detection.

Description

Webshell detection method and device, electronic equipment and storage medium
Technical Field
One or more embodiments of the present disclosure relate to the field of information security technologies, and in particular, to a webshell detection method, an apparatus, an electronic device, and a storage medium.
Background
The webshell structure in the PHP language is diversified and difficult to detect. The existing flow detection method is low in accuracy rate of detecting webshell.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a method and an apparatus for detecting a webshell, an electronic device, and a storage medium, so as to solve the problem that the accuracy of detecting the webshell is not high.
In view of the above, one or more embodiments of the present specification provide a webshell detection method, including:
analyzing the PHP source code to obtain a token sequence and a character string constant sequence;
symbolizing the token sequence to obtain the symbolized token sequence;
vectorizing the signed token sequence to obtain a token sequence feature vector; vectorizing the character string constant sequence to obtain a character string constant sequence feature vector;
and processing the token sequence characteristic vector and the character string constant sequence characteristic vector by using a webshell detection model to obtain a webshell detection result.
In some embodiments, the parsing the PHP source code to obtain a token sequence and a string constant sequence includes:
performing lexical analysis and syntactic analysis on the PHP source code to generate an abstract syntax tree;
traversing the abstract syntax tree and extracting a control flow;
performing control flow analysis on the control flow to obtain an operation sequence;
and analyzing the operation sequence to obtain the token sequence and the character string constant sequence.
In some embodiments, the token sequence comprises:
variable names, function names, numeric constants, and string constants.
In some embodiments, the symbolizing the token sequence comprises:
mapping all the variable names into a first identifier and adding a first independent index;
mapping all the function names into a second identifier and adding a second independent index;
mapping all the numerical constants to third identifications;
all the string constants are replaced with stringLiteral.
In some embodiments, the vectorizing the token sequence after the symbolizing to obtain a token sequence feature vector includes:
representing the token sequence after the symbolization by n-grams of word granularity using a fasttext method.
In some embodiments, the webshell detection model comprises:
a deep pyramid convolutional neural network layer, a cyclic neural network layer and a full connection layer;
the step of processing the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result comprises the following steps:
processing the token sequence feature vector by using the depth pyramid convolution neural network layer to obtain the processed token sequence feature vector;
processing the character string constant sequence vector by using the recurrent neural network layer to obtain the processed character string constant sequence vector;
splicing the processed token sequence vector and the processed character string constant sequence vector to obtain a total feature vector;
and inputting the total feature vector into the full-connection layer to obtain the webshell detection result.
In some embodiments, the recurrent neural network layer is comprised of a recurrent neural network based on gate cycle units and attention mechanisms.
Based on the same inventive concept, one or more embodiments of the present specification further provide a webshell detection apparatus, including:
the analysis module is configured to analyze the PHP source code to obtain a token sequence and a character string constant sequence;
a symbolization module configured to symbolize the token sequence to obtain the token sequence after symbolization;
the vectorization module is configured to vectorize the signed token sequence to obtain a token sequence feature vector, and vectorize the character string constant sequence to obtain a character string constant sequence feature vector;
and the detection module is configured to process the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result.
Based on the same inventive concept, one or more embodiments of the present specification further provide an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, and the processor implements the method of any one of the above embodiments when executing the computer program.
Based on the same inventive concept, one or more embodiments of the present specification also provide a non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to implement the method of any one of the above embodiments.
As can be seen from the foregoing, according to the webshell detection method, the apparatus, the electronic device, and the storage medium provided in one or more embodiments of the present disclosure, a code segment subjected to a string transformation operation in a PHP source code is efficiently extracted based on a token embedding (token embedding) method, and whether the PHP source code is a webshell is accurately determined according to a transformation type, so that the method has a stronger detection capability of detecting a confused sample, reduces a false alarm rate of a white sample, and integrally improves an accuracy rate of webshell detection.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a flow diagram of a webshell detection method in accordance with one or more embodiments of the present disclosure;
FIG. 2 is a schematic diagram of a webshell detection method in accordance with one or more embodiments of the present disclosure;
FIG. 3 is a schematic diagram of an electronic device in accordance with one or more embodiments of the present disclosure.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
At present, research on webshell detection at home and abroad mainly focuses on PHP type webshell, and the main methods include detection based on static characteristics, detection based on flow characteristics, detection based on behavior characteristics and detection based on logs. Detection based on flow and behavior characteristics belongs to dynamic detection methods. Yang J et al propose a semantic perception method to detect malicious Web traffic by analyzing Web access traffic, extracting URL resources and URL queries. The Queen response army and the like detect the Webshell by extracting parameter names and parameter values of HTTP requests in Webshell traffic as features and using a machine learning model. The Duhai seal and the like perform real-time dynamic detection on PHP type webshells, and monitor the compiling and running processes of PHP codes in PHP extension, so that malicious PHPwebshells are detected and prevented. Wrench et al also propose a dynamic detection method that improves recall through de-aliasing and similarity matrix analysis methods, but with a higher false alarm rate. Canali D et al, which are used for real-time monitoring by constructing a honeypot website, analyzing the behavior of malicious codes and establishing a malicious behavior library for detecting webshells, consume a large amount of resources, are difficult to deploy, and can only detect webshells in an active state. The log-based detection method can be used for conducting troubleshooting after the system is invaded and lost, and whether the webshell exists or not is detected by distinguishing different characteristics of a normal web page and the webshell in a log file. And analyzing the log text of the server by Shiliu and the like, detecting through matching of text features, statistical features and page association features, and distinguishing the log text of the Webshell and the log text of the normal webpage file. Panjie uses a single-classification support vector machine and a neural network algorithm to cluster the logs, and optimizes the model through a genetic algorithm, so that the accuracy of the model is improved. Miao Xie et al detected suspected webshell behavior in logs by the K-Nearest Neighbor (KNN) algorithm. Yixin Wu et al extract features from raw sequence data in a Web log, recognize sessions through a statistical method based on time intervals, and find that a model based on LSTM maintains a high recall rate and accuracy, but a method based on log detection may generate a large number of false positives, and a large number of read-write logs may have an impact on the performance of a server due to the large number of logs. The file feature-based detection belongs to a static detection method, the webshell is detected through the static features of the file, the detection speed is high, detection can be performed before an attacker executes the webshell code, and the method is the most common method for detecting the webshell at present. Truong et al propose an optimal threshold-based approach to identify files containing malicious code from Web applications. The statistical method is used for counting the frequencies of malicious functions, command execution functions and the like in the webshell file, but the webshell which is easy to be encrypted and confused bypasses. The method comprises the steps of constructing an abstract syntax tree, carrying out risk evaluation on the syntax tree by using a node scoring table, carrying out stain subtree positioning on dangerous nodes, and finally carrying out matching evaluation by constructing a risk model to realize the detection of the webshell. With the continuous maturity of the machine learning model, the machine learning method is widely applied to Webshell detection to improve effectiveness and accuracy. The learner uses the word2vec model to represent each word in the HTTP request as a vector and the Web request as a matrix of fixed size. And the classification of the malicious webshell and the common webshell is carried out by using a CNN-based model, so that higher precision is realized. Learners have used f Abstract Syntax Trees (AST) Text classifiers and random forest algorithms to build PHP-webshell's detection model. Learners use PHP opcode sequences as features and detect webshells by combining TF-IDF, Word2Vec and Multi-Layer Perceptron (MLP) neural networks, respectively. The Yinan and the like propose a webshell detection method based on semantic analysis, extract taint subtrees by using an Abstract Syntax Tree (AST) and a manually defined risk feature library, calculate the risk degree of files, and perform qualitative judgment in a mode of manually setting a threshold.
The webshell attack mainly exists in a vulnerability utilization link in the web attack, and after an attacker confirms that a target system has a weakness, the ability of delivering malicious files to a target application and executing instructions is obtained by effectively constructing an attack load. In order to further attack the system, maintain persistence, tamper with, extract or destroy sensitive data, an attacker implants webshell malicious code into the system. The file-based detection is that when the webshell is just implanted, the file content written into the disk is analyzed and judged to be the webshell, and the method belongs to advance early warning; after the system is implanted into a backdoor, an attacker interacts with the webshell script to generate flow, the backdoor receives instructions, executes malicious codes and generates behaviors with certain characteristics. The flow-based detection is to extract features from the interactive flow of an attacker and the webshell and judge the features, and the behavior-based detection is to analyze abnormal behaviors of the webshell when the webshell runs, wherein the features and the abnormal behaviors belong to detection in the same category; after data interaction based on an HTTP protocol is completed, operation and access logs are left in a web system, and log-based detection is to perform global analysis on the web logs and find the difference between a normal access log and a webshell access log, so that whether webshell attack is suffered or not is judged, and webshell is located.
When the real-time intrusion of an attacker is responded, the detection method based on the file characteristics has obvious advantages in time efficiency. In addition, in the detection of the flow and the behavior, corresponding rules are formulated according to common flow or behavior characteristics of the webshell malicious code files on the basis of analyzing the webshell malicious code files, and the method is used for detection; in the process of interaction between an attacker and the webshell, a large number of access logs are generated, and the detection method for the logs needs to further judge whether the file is malicious or not by using the directions of time dimension, access frequency dimension, isolated pages and the like on the basis of analyzing the difference between the normal file and the malicious file.
As described in the background section, since PHP is a dynamic weak-style language, the parameter transfer, style conversion, and function call modes are very flexible, which brings development convenience to developers and brings many new ideas to attackers to construct a sentence webshell. This results in a variety of ponies that are difficult to detect. The existing research method has poor detection effect and low accuracy for PHP pons.
The applicant finds that, in the process of implementing the present disclosure, based on a token sequence of a PHP Abstract Syntax Tree (AST), a PHP script operation sequence analysis is performed according to a webshell detection method, an apparatus, an electronic device, and a storage medium provided in one or more embodiments of the present disclosure, which make up for the shortcomings of the existing method, can help security personnel to quickly and accurately locate a webshell position, and further improve the accuracy of webshell detection.
In view of this, one or more embodiments of the present disclosure provide a method and an apparatus for detecting a webshell, an electronic device, and a storage medium, so as to solve the problem that the accuracy of detecting the webshell is not high.
Hereinafter, with reference to fig. 1, a flowchart of a webshell detection method according to one or more embodiments of the present specification is shown, fig. 2 is a schematic diagram of a webshell detection method according to one or more embodiments of the present specification, and fig. 3 is a schematic diagram of an electronic device according to one or more embodiments of the present specification, and a technical solution of the present disclosure is further described in detail through specific embodiments.
One or more embodiments of the present specification provide a webshell detection method, which can be implemented by any device, platform, or cluster having computing and processing capabilities.
With reference to fig. 1, a flowchart of a webshell detection method according to one or more embodiments of the present disclosure and fig. 2 are schematic diagrams of a webshell detection method according to one or more embodiments of the present disclosure, where the method includes:
s1: analyzing the PHP source code to obtain a token sequence and a character string constant sequence;
in some embodiments, the parsing the PHP source code to obtain a token sequence and a string constant sequence includes:
performing lexical analysis and syntactic analysis on the PHP source code to generate an abstract syntax tree;
traversing the abstract syntax tree and extracting a control flow;
performing control flow analysis on the control flow to obtain an operation sequence;
and analyzing the operation sequence to obtain the token sequence and the character string constant sequence.
In the step, firstly, lexical analysis and syntactic analysis are carried out on the input PHP source code, more information is abstracted and converted into intermediate representation, and an Abstract Syntax Tree (AST) of the PHP source code is generated by the intermediate representation; extracting control flow of PHP source codes from different nodes and performing control flow analysis, such as function definition nodes, function call nodes and the like, by traversing an Abstract Syntax Tree (AST); after the control flow analysis of the PHP source code is completed, obtaining an operation sequence of the program, wherein the operation sequence comprises all operations executed by the PHP source code, such as assignment operation, function call operation and the like; furthermore, for the self-defined function call, recording an operation sequence contained by the function, expanding the function call into a sequence contained by the function, and for the class function call, adopting the same processing method, thereby extracting the operation sequence of the PHP source code; analyzing the operation sequence to obtain a token sequence and a character string constant sequence; furthermore, the token sequence obtained from the abstract syntax tree of the PHP source code is a sequence including only the terminal node formed according to the traversal of the front of the abstract syntax tree.
Specifically, for the following PHP source code, two string constants, assert and a, exist in the source code:
<PHP
($a = 'assert')&&($b = $_POST['a'])&& call_user_func_array($a, array($b));
>
the following token sequences were obtained by resolution:
( $var1 = StringLiteral ) && ( $var2 = $_POST [ StringLiteral ] ) && call_user_func_array ( $var1 ,array ( $var2 ) ) ;
and the following string constant sequence:
assert ,a。
wherein each element in the token sequence is a token, including brackets, and & symbols, etc. Here each token is separated by a space, the comma appearing in the sequence is also a token, the variable names $ a, $ b of the original PHP source code are mapped to $ var1 and $ var2, and the string constant sequence is the same.
In some embodiments, the token sequence comprises: variable names, function names, numeric constants, and string constants.
Specifically, the token sequence types of the present application include: variable names such as $ a and $ var; function names, such as call _ user _ func _ array, etc.; numerical constants such as 10 and 1.1; string constants such as assert and a.
S2: symbolizing the token sequence to obtain the symbolized token sequence;
in some embodiments, the symbolizing the token sequence comprises:
mapping all the variable names into a first identifier and adding a first independent index;
mapping all the function names into a second identifier and adding a second independent index;
mapping all the numerical constants to third identifications;
all the string constants are replaced with stringLiteral.
In particular, tokenization is to normalize some tokens in the token sequence, for example, $ a and $ count belong to variable names, and the tokenization can make the vector distance between the tokens to be $ var1 and $ var2 after the tokenization of $ a and $ count, respectively, meaning as variable 1 and variable 2; mapping all variable names to a first identity and adding a first independent index, e.g., $ a and $ count symbolized as $ var1 and $ var 2; mapping all the function names into a second identifier and adding a second independent index; mapping all of the numerical constants to a third identity numLiteral, such as the statement "$ a = 10" symbolized as $ var1 = NumeLiteral; all the string constants are replaced with stringLiteral.
S3: vectorizing the signed token sequence to obtain a token sequence feature vector; vectorizing the character string constant sequence to obtain a character string constant sequence feature vector;
in some embodiments, the vectorizing the token sequence after the symbolizing to obtain a token sequence feature vector includes:
representing the token sequence after the symbolization by n-grams of word granularity using a fasttext method.
Specifically, the input of the neural network needs a specific feature representation, namely a vector, before the input of the neural network, the application needs to implement vectorization embedding (embedding) on data, when webshell detection is performed, a data set is a small amount of source codes, data samples rarely cause the problem of accuracy reduction, and since similar subwords exist in token sequences of symbolic numbers, $ var1, $ var2 and the like, the application uses an f Abstract Syntax Tree (AST) text to generate a pre-training vector. fasttext is a fast text classification algorithm, and has three advantages compared with a classification algorithm based on a neural network: the fasttext accelerates the training speed and the testing speed under the condition of keeping high precision; the fastText may automate training word vectors. X1, x2, … xN in the fastText model architecture represent n-gram vectors in a text, and each feature is an average value of word vectors. The N-gram is an algorithm based on a language model, the principle is that the text content is subjected to window sliding operation with the size of N according to the sequence of subsections, a byte fragment sequence with the window of N is finally formed, and the N-gram can have different meanings according to different granularities, and has N-grams with word granularity and N-grams with word granularity. Better word vectors can be generated for rare words using n-grams; in lexical words, word vectors for words can be constructed from character-level n-grams even if the words do not appear in the training corpus; the n-gram can enable the model to learn partial information of the local word sequence, and association of adjacent words in the n-gram mode enables the model to keep word sequence information during training. In order to ensure the morphological characteristics of the interior of words, such as "applet" and "applets", both words have more common characters, i.e. their internal forms are similar, the fasttext of the present application uses character-level n-grams to represent words to vectorize token sequences, and for "applet", assuming that n is 3, its trigram has: "< ap", "app", "ppl", "ple", "le >".
Based on the same principle, the method uses fasttext to vectorize the constant sequence of the character string.
S4: and processing the token sequence characteristic vector and the character string constant sequence characteristic vector by using a webshell detection model to obtain a webshell detection result.
In some embodiments, the webshell detection model comprises:
a deep pyramid convolutional neural network layer, a cyclic neural network layer and a full connection layer;
the step of processing the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result comprises the following steps:
processing the token sequence feature vector by using the depth pyramid convolution neural network layer to obtain the processed token sequence feature vector;
processing the character string constant sequence vector by using the recurrent neural network layer to obtain the processed character string constant sequence vector;
splicing the processed token sequence vector and the processed character string constant sequence vector to obtain a total feature vector;
and inputting the total feature vector into the full-connection layer to obtain the webshell detection result.
In some embodiments, the recurrent neural network layer is comprised of a recurrent neural network based on gate cycle units and attention mechanisms.
Specifically, the token sequence has a longer length compared with the constant sequence of the character string, the model long sequences such as LSTM have poor parallelism and low operation efficiency, and the Deep Pyramid Convolution Neural Network (DPCNN) has a good parallel processing effect on the long sequences, and can improve the storage efficiency and shorten the processing time. Therefore, the token sequence is processed by adopting a Deep Pyramid Convolution Neural Network (DPCNN), a character string sequence is processed by adopting a gate cycle unit (GRU) and a cyclic neural network of an attention mechanism (attention), the processed token sequence vector and the processed character string constant sequence vector are spliced to obtain a total feature vector, and the total feature vector at the moment can represent the feature vector of a source code; and classifying the total characteristic vectors by using a full connection layer to obtain a webshell detection result.
After vectorizing the token sequence and the string constant sequence, 2 sequences are processed by RNN (LSTM, GRU), respectively, and then the processed vectors are aggregated and then classified by the full-link layer.
It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.
It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, one or more embodiments of the present specification further provide a webshell detection apparatus, including:
the analysis module is configured to analyze the PHP source code to obtain a token sequence and a character string constant sequence;
a symbolization module configured to symbolize the token sequence to obtain the token sequence after symbolization;
the vectorization module is configured to vectorize the signed token sequence to obtain a token sequence feature vector, and vectorize the character string constant sequence to obtain a character string constant sequence feature vector;
and the detection module is configured to process the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, one or more embodiments of the present specification further provide an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, and the processor implements the method of any one of the above embodiments when executing the computer program.
Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, one or more embodiments of the present specification also provide a non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to implement the method of any one of the above embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
As can be seen from the foregoing, according to the webshell detection method, the apparatus, the electronic device, and the storage medium provided in one or more embodiments of the present disclosure, a code segment subjected to a string transformation operation in a PHP source code is efficiently extracted based on a token embedding (token embedding) method, and whether the PHP source code is a webshell is accurately determined according to a transformation type, so that the method has a stronger detection capability of detecting a confused sample, reduces a false alarm rate of a white sample, and integrally improves an accuracy rate of webshell detection.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.
It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (6)

1. A webshell detection method, comprising:
analyzing the PHP source code to obtain a token sequence and a character string constant sequence, wherein the step of analyzing the PHP source code to obtain the token sequence and the character string constant sequence comprises the following steps:
performing lexical analysis and syntactic analysis on the PHP source code to generate an abstract syntax tree;
traversing the abstract syntax tree and extracting a control flow;
performing control flow analysis on the control flow to obtain an operation sequence;
analyzing the operation sequence to obtain the token sequence and the character string constant sequence; wherein the token sequence comprises: variable names, function names, numerical constants and string constants;
symbolizing the token sequence to obtain the tokenized token sequence, wherein the symbolizing the token sequence includes:
mapping all the variable names into a first identifier and adding a first independent index;
mapping all the function names into a second identifier and adding a second independent index;
mapping all the numerical constants to third identifications;
replacing all of the string constants with stringiLiteral;
vectorizing the signed token sequence to obtain a token sequence feature vector; vectorizing the character string constant sequence to obtain a character string constant sequence feature vector;
processing the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result, wherein the webshell detection model comprises:
a deep pyramid convolutional neural network layer, a cyclic neural network layer and a full connection layer;
the step of processing the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result comprises the following steps:
processing the token sequence feature vector by using the depth pyramid convolution neural network layer to obtain the processed token sequence feature vector;
processing the character string constant sequence vector by using the recurrent neural network layer to obtain the processed character string constant sequence vector;
splicing the processed token sequence vector and the processed character string constant sequence vector to obtain a total feature vector;
and inputting the total feature vector into the full-connection layer to obtain the webshell detection result.
2. The webshell detection method of claim 1, wherein the vectorizing the tokensequence after the symbolizing to obtain a token sequence feature vector comprises:
representing the token sequence after the symbolization by n-grams of word granularity using a fasttext method.
3. The webshell detection method of claim 2, wherein the recurrent neural network layer is comprised of a recurrent neural network based on gate cycle units and an attention mechanism.
4. A webshell detection apparatus, comprising:
the analysis module is configured to analyze the PHP source code to obtain a token sequence and a string constant sequence, where the analyzing the PHP source code to obtain the token sequence and the string constant sequence includes:
performing lexical analysis and syntactic analysis on the PHP source code to generate an abstract syntax tree;
traversing the abstract syntax tree and extracting a control flow;
performing control flow analysis on the control flow to obtain an operation sequence;
analyzing the operation sequence to obtain the token sequence and the character string constant sequence; wherein the token sequence comprises: variable names, function names, numerical constants and string constants;
a symbolization module configured to symbolize the token sequence to obtain the token sequence after symbolization, where the symbolizing the token sequence includes:
mapping all the variable names into a first identifier and adding a first independent index;
mapping all the function names into a second identifier and adding a second independent index;
mapping all the numerical constants to third identifications;
replacing all of the string constants with stringiLiteral;
the vectorization module is configured to vectorize the signed token sequence to obtain a token sequence feature vector, and vectorize the character string constant sequence to obtain a character string constant sequence feature vector;
the detection module is configured to process the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result, wherein the webshell detection model comprises:
a deep pyramid convolutional neural network layer, a cyclic neural network layer and a full connection layer;
the step of processing the token sequence feature vector and the character string constant sequence feature vector by using a webshell detection model to obtain a webshell detection result comprises the following steps:
processing the token sequence feature vector by using the depth pyramid convolution neural network layer to obtain the processed token sequence feature vector;
processing the character string constant sequence vector by using the recurrent neural network layer to obtain the processed character string constant sequence vector;
splicing the processed token sequence vector and the processed character string constant sequence vector to obtain a total feature vector;
and inputting the total feature vector into the full-connection layer to obtain the webshell detection result.
5. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable by the processor, wherein the processor implements the method of any of claims 1-3 when executing the computer program.
6. A non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer instructions that, when executed by a computer, cause the computer to implement the method of any of claims 1-3.
CN202110374845.7A 2021-04-08 2021-04-08 Webshell detection method and device, electronic equipment and storage medium Active CN112800427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110374845.7A CN112800427B (en) 2021-04-08 2021-04-08 Webshell detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110374845.7A CN112800427B (en) 2021-04-08 2021-04-08 Webshell detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112800427A CN112800427A (en) 2021-05-14
CN112800427B true CN112800427B (en) 2021-09-28

Family

ID=75816512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110374845.7A Active CN112800427B (en) 2021-04-08 2021-04-08 Webshell detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112800427B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591074A (en) * 2021-06-21 2021-11-02 北京邮电大学 Webshell detection method and device
CN113315789B (en) * 2021-07-29 2021-10-15 中南大学 Web attack detection method and system based on multi-level combined network
CN113761533A (en) * 2021-09-08 2021-12-07 广东电网有限责任公司江门供电局 Webshell detection method and system
CN114499944B (en) * 2021-12-22 2023-08-08 天翼云科技有限公司 Method, device and equipment for detecting WebShell
CN114430348B (en) * 2022-02-07 2023-12-05 云盾智慧安全科技有限公司 Web site search engine optimization backdoor identification method and device
CN114422148B (en) * 2022-03-25 2024-04-09 北京长亭未来科技有限公司 Framework depiction and detection method, device and equipment of Webshell
CN117171053B (en) * 2023-11-01 2024-02-20 睿思芯科(深圳)技术有限公司 Test method, system and related equipment for vectorized programming

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341399A (en) * 2016-04-29 2017-11-10 阿里巴巴集团控股有限公司 Assess the method and device of code file security
CN107612926A (en) * 2017-10-12 2018-01-19 成都知道创宇信息技术有限公司 A kind of a word WebShell hold-up interception methods based on client identification
WO2020000743A1 (en) * 2018-06-27 2020-01-02 平安科技(深圳)有限公司 Webshell detection method and related device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112491882A (en) * 2020-11-27 2021-03-12 泰康保险集团股份有限公司 Webshell detection method, webshell detection device, webshell detection medium and electronic equipment
CN112600797A (en) * 2020-11-30 2021-04-02 泰康保险集团股份有限公司 Method and device for detecting abnormal access behavior, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341399A (en) * 2016-04-29 2017-11-10 阿里巴巴集团控股有限公司 Assess the method and device of code file security
CN107612926A (en) * 2017-10-12 2018-01-19 成都知道创宇信息技术有限公司 A kind of a word WebShell hold-up interception methods based on client identification
WO2020000743A1 (en) * 2018-06-27 2020-01-02 平安科技(深圳)有限公司 Webshell detection method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于语义分析的PHP Webshell检测方法研究;岳子涵等;《通信技术》;20201231;第53卷(第12期);全文 *
基于语义分析的Webshell检测技术研究;易楠等;《信息安全研究》;20170228;第3卷(第2期);全文 *

Also Published As

Publication number Publication date
CN112800427A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800427B (en) Webshell detection method and device, electronic equipment and storage medium
CN107659570B (en) Webshell detection method and system based on machine learning and dynamic and static analysis
US10855717B1 (en) Systems and methods of intelligent and directed dynamic application security testing
CN107292170B (en) Method, device and system for detecting SQL injection attack
US11048798B2 (en) Method for detecting libraries in program binaries
RU2722692C1 (en) Method and system for detecting malicious files in a non-isolated medium
CN111092894A (en) Webshell detection method based on incremental learning, terminal device and storage medium
Landauer et al. Time series analysis: unsupervised anomaly detection beyond outlier detection
CN113162794A (en) Next-step attack event prediction method and related equipment
KR20200076426A (en) Method and apparatus for malicious detection based on heterogeneous information network
CN113312618A (en) Program vulnerability detection method and device, electronic equipment and medium
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
Wei et al. Toward identifying APT malware through API system calls
Lageman et al. B in dnn: Resilient function matching using deep learning
CN113971284B (en) JavaScript-based malicious webpage detection method, equipment and computer readable storage medium
Zhang et al. Slowing down the aging of learning-based malware detectors with api knowledge
CN108875374B (en) Malicious PDF detection method and device based on document node type
US20230048076A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
KR102411383B1 (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
CN116361793A (en) Code detection method, device, electronic equipment and storage medium
US20230254340A1 (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
CN116192462A (en) Malicious software analysis method and device based on PE file format
CN111475812B (en) Webpage backdoor detection method and system based on data executable characteristics
CN116578979B (en) Cross-platform binary code matching method and system based on code features
Gu et al. GSEDroid: GNN-based android malware detection framework using lightweight semantic embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant