CN114722389A

CN114722389A - Webshell file detection method and device, electronic device and readable storage medium

Info

Publication number: CN114722389A
Application number: CN202210251448.5A
Authority: CN
Inventors: 邓巧华; 余燕; 樊颖爽; 代维; 马蔚彦; 林育民
Original assignee: River Security Inc
Current assignee: River Security Inc
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-07-08

Abstract

The invention discloses a Webshell file detection method and device, electronic equipment and a readable storage medium, and relates to the technical field of computers, in particular to the technical fields of network security, deep learning and the like. The specific implementation scheme is as follows: vectorizing the script to be detected to obtain vectorized data; performing feature extraction processing on the vectorization data to obtain script feature data; and outputting a Webshell file detection result of the script to be detected by utilizing a pre-trained first machine learning model based on the script characteristic data. The Webshell file detection of the script to be detected can be realized without any manual operation, the operation is simple, and errors are not easy to make, so that the reliability of the Webshell file detection of the script is improved.

Description

Webshell file detection method and device, electronic device and readable storage medium

Technical Field

Relates to the technical field of computers, in particular to the technical fields of network security, deep learning and the like.

Background

With the rapid development of industrial internet business systems, a great deal of security risks emerging from the internet also bring great challenges to business systems. The current internet application provides convenience for service handling, and provides convenience for an attacker to snoop vulnerabilities and dig attack entries. The website backdoor (Webshell) is an attack means widely used by attackers due to the characteristics of privacy, script-based, flexibility, convenience, powerful functions and the like, so detection of the Webshell also becomes the key point of enterprise security defense, Webshell detection is the standard matching function of a host security system, Webshell detection and identification capabilities are improved, a lot of potential attacks can be effectively blocked, and network security is greatly improved.

Disclosure of Invention

The disclosure provides a Webshell file detection method and device, electronic equipment and a readable storage medium.

According to one aspect of the disclosure, a method for detecting a Webshell file is provided, which includes:

vectorizing the script to be detected to obtain vectorized data;

performing feature extraction processing on the vectorization data to obtain script feature data;

and outputting a Webshell file detection result of the script to be detected by utilizing a pre-trained first machine learning model based on the script characteristic data.

According to another aspect of the present disclosure, there is provided a Webshell file detection apparatus, including:

the vector unit is used for vectorizing the script to be detected to obtain vectorized data;

the characteristic unit is used for carrying out characteristic extraction processing on the vectorization data to obtain script characteristic data;

and the output unit is used for outputting the Webshell file detection result of the script to be detected by utilizing the pre-trained first machine learning model based on the script feature data.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the above-described aspect and any possible implementation.

According to the technical scheme, the script to be detected is subjected to vectorization processing to obtain vectorization data, and then the vectorization data is subjected to feature extraction processing to obtain script feature data, so that a Webshell file detection result of the script to be detected can be output by utilizing the pre-trained first machine learning model based on the script feature data, the detection result is judged, the Webshell file detection of the script to be detected can be realized without any manual operation, the operation is simple, errors are not prone to occurring, and the reliability of the Webshell file detection of the script is improved.

In addition, by adopting the technical scheme provided by the disclosure, the compiled script is subjected to vectorization to obtain a vocabulary sequence, so that a first static content feature vector is obtained, the compiled machine execution code is obtained, so that a dynamic operation feature vector is obtained, the feature extraction processing is further performed on the data subjected to the vectorization processing, the first static content feature vector and the dynamic operation feature vector are subjected to feature combination processing to be used as script feature data to detect the Webshell file, and the detection mode effectively combines the semantic features of the script content and the operation features of a script bottom layer operation mechanism, so that the reliability of the Webshell file detection of the compiled script can be effectively improved, and the safety of the system is further improved.

In addition, by adopting the technical scheme provided by the disclosure, through vectorization processing of the script to be detected, a vocabulary sequence is obtained for the uncompiled script, and then a second static content characteristic vector is obtained, and then characteristic extraction processing is further carried out on the data after the vectorization processing, and the second static content characteristic vector is used as script characteristic data to detect the Webshell file.

In addition, by adopting the technical scheme provided by the disclosure, the user experience can be effectively improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

fig. 4 is a block diagram of an electronic device for implementing the Webshell file detection method according to the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is to be understood that the described embodiments are only a few, and not all, of the disclosed embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without inventive step, are intended to be within the scope of the present disclosure.

It should be noted that the terminal device involved in the embodiments of the present disclosure may include, but is not limited to, a mobile phone, a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), and other intelligent devices; the display device may include, but is not limited to, a personal computer, a television, and the like having a display function.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

With the rapid development of industrial internet business systems, a great deal of emerging security risks from the internet also bring great challenges to business systems. The current internet application provides convenience for service handling, and provides convenience for an attacker to snoop vulnerabilities and dig attack entries. Webshell is an attack means widely used by attackers due to the characteristics of secrecy, script-based, flexibility, convenience, powerful functions and the like, so detection of Webshell also becomes the key point of enterprise security defense, Webshell detection is the standard matching function of a host security system, Webshell detection and identification capabilities are improved, a lot of potential attacks can be effectively blocked, and network security is greatly improved.

Because the traditional Webshell prevention means in the security scene is based on the manual feature extraction of attack samples to form a specific rule base for matching, the feature extraction efficiency is low, the maintenance of the rule base is complicated, the prevention effect on unknown samples is poor, and a lot of false reports and false reports exist.

Therefore, it is urgently needed to provide a detection method for a Webshell file, which can effectively improve the reliability of detecting the Webshell file of a script, and further improve the security of a system.

The method for detecting the Webshell file can be used for learning a large number of samples through a deep learning-based method, obtaining the implicit relation from the data characteristics, establishing a relatively complex mathematical mechanism to deduce unknown data, reducing the maintenance cost of people on a large number of rules, and effectively identifying the unknown samples.

The technical scheme provided by the disclosure can be applied to various application scenes needing to detect scripts of the Webshell file, for script files of different language types, the Webshell file detection is mainly performed on scripts of language types such as Hypertext Preprocessor (PHP) type, Java Server Pages (JSP) type and the like aiming at the compliable scripts, and for scripts of other language types, the detection of the Webshell file is mainly performed on scripts of non-compliable scripts, such as scripts of dynamic Server Pages (ASP) type, ASPX type and the like.

Based on the application scenario, the hardware operating environment corresponding to the technical solution provided by the present disclosure may be a hardware device with data processing and data storage capabilities, and is not particularly limited in the embodiment of the present disclosure.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, as shown in fig. 1.

101. And vectorizing the script to be detected to obtain vectorized data.

102. And performing feature extraction processing on the vectorization data to obtain script feature data.

103. And outputting a Webshell file detection result of the script to be detected by utilizing a pre-trained first machine learning model based on the script characteristic data.

The first machine learning model may be a machine learning model corresponding to a language type used by the script to be detected, and is used for learning script feature data to determine a detection result.

Therefore, a detection result of the Webshell file of the script to be detected is obtained, namely whether the script to be detected contains the Webshell file or not is judged, and then safety defense measures can be executed according to the detection result, so that the safety of the system is improved.

It should be noted that part or all of the execution subjects of 101 to 104 may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a security defense processing platform on the network side, which is not particularly limited in this embodiment.

It is to be understood that the application may be a native app (native app) installed on the home terminal, or may also be a web page app (webApp) of a browser on the home terminal, which is not limited in this embodiment.

Therefore, the script to be detected is vectorized to obtain vectorized data, and further, the vectorized data is subjected to feature extraction processing to obtain script feature data, so that the Webshell file detection result of the script to be detected can be output by utilizing the pre-trained first machine learning model based on the script feature data, the detection result is judged, the Webshell file detection of the script to be detected can be realized without any manual operation, the operation is simple, errors are not easy to occur, and the reliability of the Webshell file detection of the script is improved.

Optionally, in a possible implementation manner of this embodiment, in 101, the script to be detected may be a script file obtained in a network detection environment, or may be a script file that a user uploads and needs to perform Webshell file detection, which is not particularly limited in this embodiment.

Optionally, in a possible implementation manner of this embodiment, in 101, a language type of a language used by the script to be detected may be referred to as a language type of the foot to be detected, and specifically may be a Hypertext Preprocessor (PHP), a Java Server Page (JSP), a dynamic Server Page (ASP), ASPX, and the like.

In the embodiment of the present disclosure, an implementation manner may be mainly used to determine whether the script to be detected is a compliable script, that is, whether the script to be detected is a PHP script, that is, whether the script to be detected can be compiled into an Operation Code (OPCODE). For the non-compilable script, it is necessary to distinguish which language type script the script file is specific.

In the embodiment of the present disclosure, in another implementation, whether the script to be detected is a compliable script or not may also be determined whether the script is a PHP script or a JSP script, that is, whether the script to be detected can be compiled into an Operation Code (OPCODE) or a BYTECODE (BYTECODE). For the non-compilable script, the specific language type script of the script file is further distinguished.

Conventionally, the common suffix names of the conventional Webshell files can be simply distinguished, the common suffix names of the conventional Webshell files mainly comprise php, jsp, asp, aspx and the like, and the suffix names directly correspond to the language types of the scripts.

According to the script type of the script to be detected, the script type of the script to be detected can be determined, namely the editable script and the non-editable script.

For example, the compilable script may be an executable PHP script, and when the PHP script is parsed by a compiler, the script is lexically and grammatically parsed and then compiled into an OPCODE execution code for execution.

After the script type of the script to be detected is determined, 101-103 can be further executed according to the determined script type of the script to be detected.

Optionally, in a possible implementation manner of this embodiment, in 101, if the script to be detected is a compliable script, a first vector quantization process may be performed on the script to be detected to obtain a first static content feature vector and a dynamic running feature vector of the script to be detected.

In a specific implementation process, word segmentation processing may be specifically performed on the script to be detected to obtain a first vocabulary sequence of the script to be detected, and then content statistics processing may be performed on the first vocabulary sequence based on a specific vocabulary to obtain content statistics parameters of the script to be detected. And then, according to the content statistical parameters, obtaining a first static content characteristic vector of the script to be detected by using a pre-trained second machine learning model.

The second machine learning model may be a machine learning model corresponding to a language adopted by the script to be detected, and is used for learning content statistical parameters of the script to be detected and outputting a static content feature vector.

Specifically, the word segmentation processing is performed on the script to be detected, and the word segmentation operation of the code content of the script to be detected may be mainly included.

Since some Webshell files need to circumvent detection of some defense mechanisms, obfuscation or encryption techniques may be used to make the code look cluttered, but at the same time some detection methods identify it as a normal file because they cannot match to the feature library.

Taking the PHP big horse as an example, very many functions need to be implemented, such as linking databases, directory traversal, file viewing, file modification, executing shell commands, right-lifting, etc. PHP may contain a global configuration file, and index may be some simple configuration and routing, so for many PHPs, the Webshell file size may be larger than the normal PHP.

Therefore, according to the script to be detected, statistical processing can be performed on the content of the script to be detected based on specific words such as sensitive keywords, specific operation keywords and the like, so that some content statistical parameters can be obtained.

For example, the risk level of the script to be detected can be embodied by counting whether the script to be detected contains sensitive keywords, such as keywords like eval, shell _ exec, and the like.

Or, for another example, whether confusion or encryption exists in the script to be detected or not may be counted to embody the confusion degree of the script to be detected, such as parameters such as an entropy value.

Or, for another example, the encryption degree of the script to be detected, such as the longest string length, may be embodied by counting the word length in the script to be detected.

After the content statistical parameters of the script to be detected are obtained, the feature vector of the script to be detected can be obtained based on the content statistical parameters, and can be used for representing semantic features of script content.

Then, combining the effects of various variant obfuscations or encryptions, the content statistics parameters may include, but are not limited to, at least one of the following parameters:

the maximum string length, entropy value, number of times of command execution type keywords (e.g. eval, cmd _ exec, etc.), number of times of encoding decoding type keywords (e.g. base64_ encode, etc.), number of times of file operation type keywords (e.g. fopen, openair, etc.), number of times of compression operation type keywords (e.g. gzcompress, etc.), number of times of character operation type keywords (e.g. chr, ord, etc.), number of times of character string operation type keywords (e.g. str _ replace, etc.), and number of times of other operation type keywords (e.g. pack, etc.).

In this implementation process, further, after the content statistical parameter of the script to be detected is obtained, the first static content feature vector of the script to be detected may be obtained according to the content statistical parameter, specifically, by using a pretrained Deep Neural Network (DNN) model. The DNN model may be a machine learning model corresponding to a language used by the script to be detected, and is used to learn content statistical parameters of the script to be detected, so as to output a static content feature vector.

Specifically, the obtained content statistical parameters of the script to be detected may be normalized, for example, after the mean value of each content statistical parameter is removed, the variance is used to perform normalization, so that all the parameters are gathered near 0, and the variance is 1, where the mean value and the variance are obtained by calculation according to a data set of a training sample. And inputting the content statistical parameters after the standardization treatment into the pre-trained DNN, thereby outputting a specific characteristic vector as a first static content characteristic vector of the script to be detected.

Therefore, the static content characteristic vector obtained by passing the content statistical parameters of the script to be detected through the DNN model can more effectively embody the internal relation among the content statistical parameters of the script to be detected.

In another specific implementation process, the script to be detected may be specifically compiled to obtain a machine execution code of the script to be detected, and then the machine execution code may be vectorized to obtain a vectorization sequence of the machine execution code. Then, according to the vectorization sequence of the machine execution code, a pre-trained third machine learning model is used to obtain the dynamic operation feature vector of the script to be detected.

The third machine learning model may be a machine learning model corresponding to a language adopted by the script to be detected, and is used for learning a vectorization sequence of the machine execution code to realize output of the dynamic operation feature vector.

Taking the PHP script as an example, the PHP script may be compiled by a compiler, and an Operation Code (OPCODE) after the compiling may be extracted. Wherein, the OPCODE is a machine execution code of the PHP script. Further, OPCODEs are arranged according to the execution order of the PHP script to obtain an OPCODE sequence, i.e., a machine execution code sequence.

In the implementation process, the machine execution code obtained by compiling can effectively obtain the internal operation logic of the script to be detected, thereby reducing the influence on the basic characteristics of the script to be detected due to modes such as confusion and coding.

In the implementation process, after the machine execution code of the script to be detected is obtained, a Word Embedding (Word Embedding) mode may be adopted to perform vectorization processing on the machine execution code so as to obtain a vectorization sequence of the machine execution code.

Specifically, a Word vector (Word2vec) model can be adopted to realize a Word Embedding (Word Embedding) mode.

The Word2vec model is one of Word Embedding modes, and is a process of converting words into vectors which can be calculated and structured. After the Word2vec model is fully trained, Word vectors are found to be meaningful, for example, the Euclidean distance of words with similar semantics is relatively short, and Word vectors of some words can be added or subtracted to obtain another Word. Therefore, after vectorization processing is carried out on the machine execution code through the Word2vec model, the abstract mathematical expression of the machine execution code is realized, and further more effective feature extraction of the execution code of the script to be detected is realized.

In the implementation process, after the vectorization sequence of the machine execution code is obtained, the dynamic running feature vector of the script to be detected can be obtained by using a pre-trained Recurrent Neural Network (RNN) model according to the vectorization sequence of the machine execution code. The RNN model may specifically be a machine learning model corresponding to a language adopted by the script to be detected, and is used to learn a vectorization sequence of the machine execution code, so as to output the dynamic operation feature vector.

Specifically, the obtained vectorization sequence of the machine-executable code may be specifically subjected to a normalization process, for example, a padding operation or a truncation operation according to a sequence length. And inputting the vectorization sequence after the standardization treatment into a pre-trained RNN model, thereby outputting a specific characteristic vector as a dynamic operation characteristic vector of the script to be detected.

Therefore, the obtained dynamic content characteristic vector more effectively reflects the overall characteristics of the machine execution code of the script to be detected by passing the vectorization sequence of the machine execution code through the RNN model.

Specifically, a Long Short-Term Memory (LSTM) model, which is a special model of DNN models, may be used.

The LSTM model is a model commonly used in natural language processing, and is mainly used to process time series data. In the traditional neural network model, from an input layer to a hidden layer to an output layer, all layers are connected, and nodes between each layer are connectionless. For example, you would typically need to predict what the next word in a sentence is, and use the previous word, because the previous and next words in a sentence are not independent, and the current output of a sequence is related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the last moment. And the storage of a long-distance state is increased by adopting an LSTM algorithm, and more effective feature acquisition of long sequence data is ensured. Therefore, the LSTM model can be used for more effectively acquiring the overall characteristics of the machine execution code sequence.

In another specific implementation process, after obtaining the first static content feature vector and the dynamic running feature vector, in 102, specifically, the first static content feature vector and the dynamic running feature vector may be subjected to feature merging processing to serve as the script feature data.

In the implementation process, the static content feature vector obtained through the DNN model and the dynamic running feature vector obtained through the LSTM model are combined to be used as the integral feature vector of the script to be detected. For example, the static content feature of each script to be detected is a vector with 128 components, and the dynamic running feature is a vector with 100 components, then the overall feature of the script to be detected obtained after merging is a vector with 228 components.

In the implementation mode, the compilable script is subjected to vectorization processing to obtain a vocabulary sequence, so that a first static content characteristic vector is obtained, a machine execution code after the compilable processing is obtained, so that a dynamic operation characteristic vector is obtained, further, the characteristic extraction processing is performed on the data after the vectorization processing, and the first static content characteristic vector and the dynamic operation characteristic vector are subjected to characteristic combination processing to be used as script characteristic data to detect the Webshell file.

The technical scheme provided by the implementation mode has the advantages of high accuracy, difficulty in bypassing and the like.

Optionally, in a possible implementation manner of this embodiment, in 101, if the script to be detected is a non-compilable script, second vectorization processing may be performed on the script to be detected, so as to obtain a vectorization sequence of the script to be detected.

In a specific implementation process, word segmentation processing may be specifically performed on the script to be detected to obtain a second vocabulary sequence of the script to be detected, and then the second vocabulary sequence may be vectorized to obtain a vectorized sequence of the script to be detected.

Specifically, the script to be detected may be subjected to word segmentation, which mainly includes word segmentation operation of the code content of the script to be detected, and since the code content is different from the ordinary text information, a specific regularization mode may be specifically adopted to preprocess the code content of the script to be detected into a vocabulary sequence.

Furthermore, according to a pre-established script feature vocabulary table, the feature vocabulary in the script feature vocabulary table is utilized to extract effective vocabulary in the obtained vocabulary sequence and filter ineffective vocabulary in the obtained vocabulary sequence, so that the code content of the script to be detected is preprocessed into the vocabulary sequence, namely the second vocabulary sequence of the script to be detected.

For example, a script feature vocabulary may be specifically constructed by using a Term Frequency-Inverse Document Frequency (TF-IDF) algorithm, and part of vocabulary noise in data may be filtered, so that more effective data features may be obtained based on the script feature vocabulary.

The TF-IDF algorithm is mainly used to evaluate the importance of the vocabulary to one of the documents in the corpus. Wherein tf (term frequency) represents the word frequency and the number of times the word appears in the Document, and idf (inverse Document frequency) represents the inverse Document frequency and the importance degree of the word in the whole corpus. For example, if a word such as "and" is present in almost all documents, then the word is not characteristic and can be filtered out from appearing in the script feature vocabulary. When the word segmentation processing is carried out on the script to be detected, the pre-constructed script feature vocabulary table is utilized to extract the vocabulary which is already present in the script feature vocabulary table, and the vocabulary which is not present in the script feature vocabulary table is distinguished by a specific mark, such as 'UN', 'UNKNOWN', etc.

In the implementation process, after the second vocabulary sequence of the script to be detected is obtained, a Word Embedding (Word Embedding) mode may be adopted to perform vectorization processing on the second vocabulary sequence of the script to be detected, so as to obtain the vectorization sequence of the script to be detected.

Thus, the Word sequence is converted into corresponding vector through the pre-trained Word2vec model, and the Word sequence is converted into vectorization data. Therefore, after the Word sequence is subjected to vectorization processing through the Word2vec model, the abstract mathematical expression of the Word sequence is realized, and further, the more effective feature extraction of the semantics of the script to be detected is realized.

In another specific implementation process, after the vectorization sequence of the script to be detected is obtained, in 102, feature extraction processing may be specifically performed on the vectorization sequence of the script to be detected, so as to obtain a second static content feature vector of the script to be detected, which is used as the script feature data.

Specifically, according to the vectorization sequence of the script to be detected, semantic abstraction processing may be performed by using another pre-trained machine learning model or a feature module in the first machine learning model, so as to obtain the second static content feature vector of the script to be detected.

In this implementation manner, by adopting the technical scheme provided by the present disclosure, through vectorization processing of the script to be detected, a vocabulary sequence is obtained for the uncompiled script, and then a vectorization sequence of the script to be detected is obtained, and then feature extraction processing is further performed on the data subjected to the vectorization processing, so that a second static content feature vector of the script to be detected is obtained, and is used as script feature data to detect the Webshell file.

The technical scheme provided by the implementation mode has the advantages of flexible characteristics, low maintenance cost, low false alarm rate and the like.

The technical schemes provided by the two implementation modes are combined for use, and Webshell file detection of scripts to be detected in various language types can be effectively finished.

Optionally, in a possible implementation manner of this embodiment, in 103, a Webshell file detection result of the script to be detected may be output by using a pre-trained Deep Neural Network (DNN) model specifically based on the script feature data. The DNN model may be a machine learning model corresponding to a language used by the script to be detected, and is used to learn script feature data of the script to be detected, so as to determine a detection result.

In a specific implementation process, if the script to be detected is a compliable script, the combined feature vector is input into a full-connection DNN model, the final prediction result is judged, the detection processing of the Webshell file of the compliable script is completed, the advantages of different features can be integrated through the fusion detection mode, and the detection precision of the Webshell file is improved.

Optionally, in a possible implementation manner of this embodiment, in 103, a Webshell file detection result of the script to be detected may be output by using a pre-trained Convolutional Neural Network (CNN) model based on the script feature data. The CNN model may be a machine learning model corresponding to a language used by the script to be detected, and is used to learn script feature data of the script to be detected, so as to determine a detection result.

In a specific implementation process, if the script to be detected is a non-compilable script, the obtained static content feature vector is input into a fully-connected network in a CNN model, the final prediction result is judged, the detection processing of the Webshell file of the non-compilable script is completed, the internal semantic features of the script to be detected are automatically extracted through a plurality of set convolution kernels, the final judgment result is output, and the detection result of the Webshell file of the script to be detected is obtained.

Specifically, the obtained static content feature vector may be specifically subjected to a normalization process, for example, a padding operation or a truncation operation according to a vocabulary length. And inputting the static content feature vector after the standardization treatment into a pre-trained CNN model, thereby outputting a final judgment result as a detection result of the script to be detected.

In the embodiment, the script to be detected is vectorized to obtain vectorized data, and then the vectorized data is subjected to feature extraction processing to obtain script feature data, so that the Webshell file detection result of the script to be detected can be output by using the pre-trained first machine learning model based on the script feature data, the detection result is judged, the Webshell file detection of the script to be detected can be realized without any manual operation, the operation is simple, errors are not easy to occur, and the reliability of the Webshell file detection of the script is improved.

The machine learning models employed in the embodiments of the present disclosure, i.e., the first machine learning model, the second machine learning model, and the third machine learning model, may utilize a script data set to be trained to perform model training.

Specifically, a data set of training samples may be acquired first. The data set of the training sample can include normal scripts and Webshell scripts (i.e., script data containing Webshell files).

For scripts of different language types, the data set of the training sample is used as a training data set of different machine learning models, for example, a normal compiled PHP script, a Webshell compiled PHP script, a normal compiled JSP script, a Webshell JSP script and the like.

In the foregoing embodiment, the vectorization processing method for different types of scripts to be detected is adopted, the corresponding vectorization processing method and the feature extraction processing method are performed on each training sample in the data set of the training samples, the processed data is further used as the input of each machine learning model, and the machine learning models are jointly trained according to the label of each training sample.

And aiming at the training process, the hyper-parameters of the deep learning network can be set, and the hyper-parameters of each machine learning model are automatically adjusted by adopting an automatic parameter adjusting algorithm.

The adjustment of the hyper-parameters has an important influence on the model training effect, and as the hyper-parameters are mainly set manually, and the quality of parameter setting is mostly determined by experience, for operators who do not know the meaning of the model parameters and have less experience, the optimal parameters of the model are automatically obtained through the automatic parameter patrol algorithm, so that the model training process can be completed more efficiently, and the optimal model is obtained.

Hyper-parameters such as batch size, learning rate, regularization coefficients, etc. The batch size affects the efficiency of the model calculation, and the range of suitable batch sizes is mainly related to the convergence rate and random gradient noise. The learning rate is an important hyper-parameter in supervised learning and deep learning, and determines whether and when the objective function can converge to a local minimum value. The purpose of regularization is to make the weights decay to smaller values, and to some extent, reduce the problem of overfitting the model, which results in a better effect of the model on the training set but a poorer result on the test set. Therefore, in the case of a data set of a specific training sample, the model can be made to have better effect only by proper hyper-parameter combination.

Because the number of adjustable hyper-parameters in the model is large, the adjustment range of each hyper-parameter is large, and manual parameter adjustment is time-consuming, the parameters need to be automatically adjusted through an automatic parameter adjustment algorithm.

Through the automatic adjustment processing, each machine learning model with the optimal effect in the test set of the training sample can be output. The effect on the test set is judged through the accuracy and the recall rate of detection on the test set, and the overfitting condition of the model is reduced.

The technical scheme provided by the disclosure can be applied to a detection scene of the Webshell file of the script to be tested, as shown in FIG. 2.

201. And acquiring the script to be detected.

202. And judging whether the script to be detected is a compliable script or not according to the language type of the script to be detected. If the script to be detected is a compliable script, executing 203 and 206 respectively; if the script to be detected is a non-compilable script, then 211 is executed.

203. And performing word segmentation on the script to be detected to obtain a first vocabulary sequence of the script to be detected.

204. And performing content statistical processing on the first vocabulary sequence based on the specific vocabulary to obtain content statistical parameters of the script to be detected.

205. And obtaining a first static content characteristic vector of the script to be detected by utilizing the pre-trained DNN model according to the content statistical parameters of the script to be detected.

206. And compiling the script to be detected to obtain the machine execution code of the script to be detected.

207. The machine execution code is vectorized using the word vector model to obtain a vectorized sequence of machine execution codes.

208. And obtaining the dynamic running characteristic vector of the script to be detected by utilizing the pre-trained LSTM model according to the vectorization sequence of the machine execution code.

209. And performing feature combination processing on the first static content feature vector and the dynamic operation feature vector to serve as script feature data of the script to be detected.

210. And outputting a Webshell file detection result of the script to be detected by utilizing the pre-trained DNN model based on script characteristic data of the script to be detected.

211. And performing word segmentation on the script to be detected to obtain a second vocabulary sequence of the script to be detected.

212. And vectorizing the second vocabulary sequence by adopting a word vector model to obtain a vectorized sequence of the script to be detected.

213. Inputting the vectorized sequence of the script to be detected into a pre-trained CNN model, performing feature extraction processing and result judgment processing, and outputting a Webshell file detection result of the script to be detected.

Specifically, feature extraction processing may be performed on the vector quantization sequence by a feature module (such as a CNN module) in the CNN model to obtain a second static content feature vector of the script to be detected, which is used as script feature data of the script to be detected. And then, based on script feature data of the script to be detected, further utilizing a classification module (such as a fully connected network) in the CNN model to perform result judgment processing, and outputting a Webshell file detection result of the script to be detected.

Therefore, detection results of the Webshell files of the scripts to be detected in various script types are obtained, and then security defense measures can be executed according to the detection results, so that the security of the system is improved.

According to the technical scheme provided by the embodiment of the disclosure, the detection effect of the Webshell file of the script to be detected is improved by adding the static characteristics of the script content to the dynamic characteristics of the machine execution code which can be compiled and compiled by extracting the compiled script and fusing the two characteristics into a deep learning network model. Meanwhile, for the uncompiled scripts, the semantic features of script contents can be automatically extracted through a convolutional neural network, artificial feature processing is reduced, the semantic features are automatically acquired from the scripts to be detected in an end-to-end mode and input into a detection model, and the detection effect of the Webshell files of the scripts to be detected with different language types is improved.

It should be noted that for simplicity of description, the above-mentioned method embodiments are described as a series of acts, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure, as shown in fig. 3. The Webshell file detection apparatus 300 of this embodiment may include a vector unit 301, a feature unit 302, and an output unit 303. The vector unit 301 is configured to perform vectorization processing on the script to be detected to obtain vectorized data; a feature unit 302, configured to perform feature extraction processing on the vectorized data to obtain script feature data; and the output unit 303 is configured to output a Webshell file detection result of the script to be detected by using the pre-trained first machine learning model based on the script feature data.

It should be noted that, part or all of the detection apparatus of the Webshell file in this embodiment may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a security defense processing platform on the network side, which is not particularly limited in this embodiment.

It is to be understood that the application may be a native application (native app) installed on the local terminal, or may also be a web page program (webApp) of a browser on the local terminal, which is not limited in this embodiment.

Optionally, in a possible implementation manner of this embodiment, the vector unit 301 may be specifically configured to, if the script to be detected is a compliable script, perform a first vector quantization process on the script to be detected to obtain a first static content feature vector and a dynamic running feature vector of the script to be detected.

In a specific implementation process, the vector unit 301 may be specifically configured to perform word segmentation on the script to be detected to obtain a first vocabulary sequence of the script to be detected; based on a specific vocabulary, performing content statistical processing on the first vocabulary sequence to obtain content statistical parameters of the script to be detected; and obtaining a first static content characteristic vector of the script to be detected by utilizing a pre-trained second machine learning model according to the content statistical parameters.

In another specific implementation process, the vector unit 301 may be specifically configured to compile the script to be detected, so as to obtain a machine execution code of the script to be detected; vectorizing the machine execution code to obtain a vectorized sequence of the machine execution code; and obtaining the dynamic running characteristic vector of the script to be detected by utilizing a pre-trained third machine learning model according to the vectorization sequence of the machine execution code.

Optionally, in a possible implementation manner of this embodiment, the vector unit 301 may be specifically configured to, if the script to be detected is a non-compilable script, perform second-directional quantization processing on the script to be detected to obtain a second static content feature vector of the script to be detected.

In a specific implementation process, the vector unit 301 may be specifically configured to perform word segmentation on the script to be detected to obtain a second vocabulary sequence of the script to be detected; and vectorizing the second vocabulary sequence to obtain a vectorized sequence of the script to be detected.

Optionally, in a possible implementation manner of this embodiment, the feature unit 302 may be specifically configured to perform feature merging processing on the first static content feature vector and the dynamic running feature vector to serve as the script feature data.

Optionally, in a possible implementation manner of this embodiment, the feature unit 302 may be specifically configured to perform feature extraction processing on the vectorization sequence of the script to be detected, so as to obtain a second static content feature vector of the script to be detected, which is used as the script feature data.

It should be noted that the method in the embodiment corresponding to fig. 1 and fig. 2 may be implemented by the Webshell file detection device provided in this embodiment. For a detailed description, reference may be made to relevant contents in the embodiments corresponding to fig. 1 and fig. 2, and details are not described here.

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the electronic device 400 can also be stored. The calculation unit 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in the electronic device 400 are connected to the I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 401 executes the respective methods and processes described above, such as the detection method of the Webshell file. For example, in some embodiments, the detection method of a Webshell file may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the Webshell file detection method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the detection method of the Webshell file in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems On Chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A detection method for a Webshell file at the backdoor of a website is characterized by comprising the following steps:

vectorizing the script to be detected to obtain vectorized data;

2. The method according to claim 1, wherein the vectorizing the script to be detected to obtain vectorized data comprises:

if the script to be detected is a compliable script, performing first vector quantization processing on the script to be detected to obtain a first static content characteristic vector and a dynamic operation characteristic vector of the script to be detected; or

And if the script to be detected is the non-compilable script, performing second-directional quantization processing on the script to be detected to obtain a vectorization sequence of the script to be detected.

3. The method according to claim 2, wherein the performing a first vector quantization process on the script to be detected to obtain a first static content feature vector of the script to be detected comprises:

performing word segmentation processing on the script to be detected to obtain a first vocabulary sequence of the script to be detected;

based on a specific vocabulary, performing content statistical processing on the first vocabulary sequence to obtain content statistical parameters of the script to be detected;

and obtaining a first static content characteristic vector of the script to be detected by utilizing a pre-trained second machine learning model according to the content statistical parameters.

4. The method according to claim 2, wherein the performing a first vector quantization process on the script to be detected to obtain a dynamic running feature vector of the script to be detected comprises:

compiling the script to be detected to obtain a machine execution code of the script to be detected;

vectorizing the machine execution code to obtain a vectorized sequence of the machine execution code;

and obtaining the dynamic running characteristic vector of the script to be detected by utilizing a pre-trained third machine learning model according to the vectorization sequence of the machine execution code.

5. The method according to claim 2, wherein performing second quantization processing on the script to be detected to obtain a vectorization sequence of the script to be detected comprises:

performing word segmentation processing on the script to be detected to obtain a second vocabulary sequence of the script to be detected;

and vectorizing the second vocabulary sequence to obtain a vectorized sequence of the script to be detected.

6. The method according to any one of claims 2-5, wherein the performing a feature extraction process on the vectorized data to obtain script feature data comprises:

performing feature merging processing on the first static content feature vector and the dynamic operation feature vector to serve as the script feature data; or

And performing feature extraction processing on the vectorization sequence of the script to be detected to obtain a second static content feature vector of the script to be detected, wherein the second static content feature vector is used as the script feature data.

7. A detection device for Webshell files at backdoor of a website is characterized by comprising the following components:

8. Device according to claim 7, characterized in that the vector unit, in particular for

9. Device according to claim 8, characterized in that the vector unit, in particular for

Performing word segmentation on the script to be detected to obtain a first vocabulary sequence of the script to be detected;

based on a specific vocabulary, performing content statistical processing on the first vocabulary sequence to obtain content statistical parameters of the script to be detected; and

and obtaining a first static content feature vector of the script to be detected by utilizing a pre-trained second machine learning model according to the content statistical parameters.

10. Device according to claim 8, characterized in that the vector unit, in particular for

vectorizing the machine execution code to obtain a vectorized sequence of the machine execution code; and

11. Device according to claim 8, characterized in that the vector unit, in particular for

Performing word segmentation processing on the script to be detected to obtain a second vocabulary sequence of the script to be detected; and

12. Device according to any of claims 8-11, characterized in that the characteristic unit, in particular for

Performing feature combination processing on the first static content feature vector and the dynamic operation feature vector to serve as the script feature data; or

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.