CN114117430A

CN114117430A - WebShell detection method, electronic device and computer-readable storage medium

Info

Publication number: CN114117430A
Application number: CN202111478134.0A
Authority: CN
Inventors: 邱林枫; 杨钦; 许渊聪; 余浩翔
Original assignee: Shanghai Anshi Information Technology Co ltd
Current assignee: Shanghai Anshi Information Technology Co ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-01

Abstract

The application relates to a WebShell detection method, an electronic device and a computer-readable storage medium. Wherein, the method comprises the following steps: acquiring flow data interacted with a file to be detected, wherein the flow data comprises a data packet sequence; extracting the effective load of each data packet in the data packet sequence and converting the effective load into tensor data to obtain a tensor data sequence; extracting the spatial features of each tensor data in the tensor data sequence by using the trained convolutional neural network to obtain a first output feature map sequence for representing the spatial features of each data packet; extracting the time characteristics of the first output characteristic diagram sequence by using a characteristic extraction core of the trained long-short term memory network to obtain a second output characteristic diagram for representing the space-time characteristics of the flow data; and determining a first detection result of the flow data according to the second output characteristic diagram, wherein the first detection result is used for representing whether the file to be detected is WebShell. By the method and the device, the detection rate of the WebShell is improved.

Description

WebShell detection method, electronic device and computer-readable storage medium

Technical Field

The present application relates to the field of information security, and in particular, to a WebShell detection method, an electronic device, and a computer-readable storage medium.

Background

WebShell is usually implanted into a website server through website vulnerabilities in advance, and a remote host accesses WebShell codes implanted into the website server in a web mode and is mainly used for website management and hacker intrusion.

The main programming languages used by the WebShell source file are PHP, ASP, JSP and the like, and data interaction is carried out with a remote host through ports such as 80 or 8080. The interactive malicious http data are mixed in the normal http data and are not easy to distinguish. Therefore, once the WebShell source file is successfully implanted into the server by the intruder, the intruder can access the WebShell code through the browser to acquire the related authority, so as to control the server and steal important data. Simply, code is executable to obtain the associated rights. An attacker who obtains the authority can often directly delete violent data, steal user data, ensnare the electronic money of a victim and the like on the server, and the harm is extremely large.

In the traditional WebShell detection method, the following problems mainly exist:

1. the detection rate is low, and the method cannot have a good detection effect on the WebShell file, particularly on the encrypted and confused WebShell file.

2. The detection method is single, and the traditional detection method is often single from the perspective of files or flow and often cannot have an all-round precaution effect.

3. Only known WebShell files can be identified, the traditional rule matching usually needs manual maintenance, only existing WebShell file characteristics in a rule base can be identified, and the WebShell files with unknown characteristics are usually reported in a missing mode.

Disclosure of Invention

The application provides a WebShell detection method, an electronic device and a computer readable storage medium, which aim to solve the problem that the WebShell detection method in the related technology is prone to being reported in a missing mode.

In a first aspect, this embodiment provides a WebShell detection method, including:

acquiring flow data interacted with a file to be detected, wherein the flow data comprises a data packet sequence;

extracting the effective load of each data packet in the data packet sequence and converting the effective load into tensor data to obtain a tensor data sequence;

extracting the spatial features of each tensor data in the tensor data sequence by using a trained convolutional neural network to obtain a first output feature map sequence for representing the spatial features of each data packet;

extracting the time characteristics of the first output characteristic diagram sequence by using a characteristic extraction core of the trained long-short term memory network to obtain a second output characteristic diagram for representing the space-time characteristics of the flow data;

and determining a first detection result of the flow data according to the second output characteristic diagram, wherein the first detection result is used for representing whether the file to be detected is WebShell.

In some embodiments, extracting spatial features of each tensor data in the sequence of tensor data using a trained convolutional neural network to obtain a first sequence of output feature maps characterizing the spatial features of each data packet comprises:

extracting spatial features of tensor data in the tensor data sequence by using a trained convolutional neural network to obtain a third output feature map sequence;

and rearranging the feature dimensions of each third output feature map in the third output feature map sequence to obtain the first output feature map sequence, wherein the feature dimensions of each first output feature map in the first output feature map sequence are the same as the feature dimensions of the input feature map of the trained long-short term memory network.

In some embodiments, extracting the payload of each packet in the sequence of packets and converting the payload into tensor data, and obtaining the sequence of tensor data includes:

for each data packet in the data packet sequence, mapping the payload of the data packet into an eight-bit gray scale image with a preset length and width, and converting the eight-bit gray scale image into two-dimensional tensor data, wherein each byte in the payload of the data packet is mapped into one pixel in the eight-bit gray scale image, and the gray scale value of the pixel is the value represented by the byte.

In some embodiments, the file to be detected is a PHP script; before obtaining traffic data, the traffic data comprising a sequence of data packets, the method further comprises:

extracting an Opcode code of the PHP script by adopting a PHP extension plug-in VLD;

performing semantic detection on the Opcode code by using a first semantic detection model based on deep learning to obtain a second detection result of the PHP script, wherein the second detection result represents a first confidence coefficient of the PHP script belonging to WebShell;

and acquiring the first detection result of the PHP script based on the flow data under the condition that the first confidence coefficient is smaller than a first preset value.

In some of these embodiments, the method further comprises:

and determining that the PHP script is WebShell under the condition that the first confidence is not less than the first preset value.

In some of these embodiments, the method further comprises:

extracting static characteristics of the source code of the file to be detected, and carrying out normalization processing on the static characteristics;

performing source code semantic detection on the normalized static features by using a second semantic detection model based on deep learning to obtain a third detection result of the file to be detected, wherein the third detection result represents a second confidence coefficient that the file to be detected belongs to WebShell;

and when the second confidence coefficient is smaller than a second preset value, acquiring the first detection result of the file to be detected based on the flow data, and/or acquiring the second detection result of the file to be detected based on an Opcode.

In some of these embodiments, the second semantic detection model comprises a Text CNN-based source code detection model and a binary network-based feature detection model; performing source code semantic detection on the normalized static features by using a second semantic detection model based on deep learning, and obtaining a third detection result of the file to be detected comprises the following steps:

performing semantic recognition on part or all of the source codes of the file to be detected by using the source code detection model to obtain a fourth detection result;

performing feature classification on the normalized static features by using the feature detection model to obtain a fifth detection result;

and combining the fourth detection result and the fifth detection result to obtain the third detection result of the file to be detected.

In some embodiments, before extracting the static features of the source code of the file to be detected, the method further includes:

and carrying out code cleaning on the source code of the file to be detected so as to remove invalid characters.

In a second aspect, the present embodiment further provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the steps of the method according to the first aspect.

In a third aspect, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

In summary, the WebShell detection method, the electronic device and the computer-readable storage medium provided by the embodiment of the application solve the problem that the WebShell detection method in the related art is prone to being missed, and improve the detection rate of the WebShell.

Drawings

Fig. 1 is a flowchart of the WebShell detection method provided in this embodiment.

Fig. 2 is a schematic structural diagram of the WebShell detection model provided in this embodiment.

FIG. 3 is a flowchart of the multi-angle WebShell detection method provided by this embodiment.

Fig. 4 is a flowchart of the WebShell detection method based on the Opcode code provided in this embodiment.

Fig. 5 is a flowchart of the WebShell detection method based on the static features of the source code provided in this embodiment.

Detailed Description

For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings.

For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings. However, it will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In some instances, well known methods, procedures, systems, components, and/or circuits have been described at a higher level without undue detail in order to avoid obscuring aspects of the application with unnecessary detail. It will be apparent to those of ordinary skill in the art that various changes can be made to the embodiments disclosed herein, and that the general principles defined herein may be applied to other embodiments and applications without departing from the principles and scope of the present application. Thus, the present application is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the scope of the present application as claimed.

Unless defined otherwise, technical or scientific terms used herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application, the terms "a," "an," "the," and the like do not denote a limitation of quantity, but rather are used in the singular or the plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or modules, but may include other steps or modules (elements) not listed or inherent to such process, method, article, or apparatus.

Reference to "a plurality" in this application means two or more. In general, the character "/" indicates a relationship in which the objects associated before and after are an "or". The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order.

The terms "system," "engine," "unit," "module," and/or "block" referred to herein is a method for distinguishing, by level, different components, elements, parts, components, assemblies, or functions of different levels. These terms may be replaced with other expressions capable of achieving the same purpose. In general, reference herein to a "module," "unit," or "block" refers to a collection of logic or software instructions embodied in hardware or firmware. The "modules," "units," or "blocks" described herein may be implemented as software and/or hardware, and in the case of implementation as software, they may be stored in any type of non-volatile computer-readable storage medium or storage device.

In some embodiments, software modules/units/blocks may be compiled and linked into an executable program. It will be appreciated that software modules may be invokable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on a computing device may be provided on a computer-readable storage medium, such as a compact disc, digital video disc, flash drive, magnetic disk, or any other tangible medium, or downloaded as digital (and may be initially stored in a compressed or installable format that requires installation, decompression, or decryption prior to execution). Such software code may be stored partially or wholly on a storage device of the executing computing device and applied in the operation of the computing device. The software instructions may be embedded in firmware, such as an EPROM. It will also be appreciated that the hardware modules/units/blocks may be included in connected logic components, such as gates and flip-flops, and/or may be included in programmable units, such as programmable gate arrays or processors. The modules/units/blocks or computing device functions described herein may be implemented as software modules/units/blocks, and may also be represented in hardware or firmware. Generally, the modules/units/blocks described herein may be combined with other modules/units/blocks or, although they are physically organized or stored, may be divided into sub-modules/sub-units/sub-blocks. The description may apply to the system, the engine, or a portion thereof.

It will be understood that when an element, engine, module or block is referred to as being "on," "connected to" or "coupled to" another element, engine, module or block, it can be directly on, connected or coupled to or in communication with the other element, engine, module or block, or intervening elements, engines, modules or blocks may be present, unless the context clearly dictates otherwise. In this application, the term "and/or" may include any one or more of the associated listed items or combinations thereof.

The embodiment provides a WebShell detection method. Fig. 1 is a flowchart of the WebShell detection method provided in this embodiment, and as shown in fig. 1, the flowchart includes the following steps:

step S101, obtaining flow data interacted with a file to be detected, wherein the flow data comprises a data packet sequence.

And step S102, extracting the effective load of each data packet in the data packet sequence and converting the effective load into tensor data to obtain a tensor data sequence.

Step S103, extracting the spatial features of each tensor data in the tensor data sequence by using the trained convolutional neural network to obtain a first output feature map sequence for representing the spatial features of each data packet.

And step S104, extracting the time characteristics of the first output characteristic diagram sequence by using the trained characteristic extraction core of the long-short term memory network to obtain a second output characteristic diagram for representing the space-time characteristics of the flow data.

And step S105, determining a first detection result of the flow data according to the second output characteristic diagram, wherein the first detection result is used for representing whether the file to be detected is WebShell.

The traffic data is composed of packets having a time sequence, and the content of the packets and the transmission sequence of the packets together constitute the characteristic information of the traffic data, wherein the characteristic related to the content of the packets is referred to as a spatial characteristic in the present embodiment; the characteristics relating to the transmission order of the data packets are referred to as time characteristics in the present embodiment, and together they constitute the spatio-temporal characteristics of the traffic data. The present embodiment classifies a to-be-detected file that generates traffic data based on spatio-temporal features of the traffic data to determine a probability (or confidence) that the detected file is WebShell.

In this embodiment, the traffic data interacted with the file to be detected may include a sending data packet generated by the file to be detected, or a receiving data packet sent to the file to be detected. Generally, after WebShell is implanted into a web server, an intruder generally needs to send a plurality of instructions to the WebShell to complete an intrusion operation, such as detecting the survival status of the WebShell, collecting relevant information of the web server being intruded through the WebShell, and performing a specific intrusion operation. Therefore, it is feasible to perform detection of WebShell based on only one of the transmission packet or the reception packet of the file to be detected. However, in a preferred embodiment, the WebShell detection is performed on the basis of both the transmission packet and the reception packet of the WebShell, so as to fully utilize the time characteristic between the transmission packet and the reception packet to improve the detection rate of the WebShell.

The interception and capture of the traffic data may be performed by any means known in the related art, and is not limited herein. For example, in a web server system using a reverse proxy technology, traffic data in each web server may be intercepted and captured on the reverse proxy server, and traffic data interacting with a certain file to be detected may be screened out based on relevant tag information in a data packet of the traffic data.

In order to obtain the spatial features of the traffic data, in this embodiment, a convolutional neural network is adopted to extract the spatial features of the traffic data, specifically, the spatial features of each data packet in a time-sequential data packet sequence carried by the traffic data are extracted. Generally, data packets are stored as PCAP files in a PCAP data packet storage format, each PCAP file includes a header and a plurality of data packets located behind the header, and each data packet includes two parts, namely a header and a payload, and may further include a check part. The information typically carried in the header and the data packet header of the PCAP file is suitable for parsing the relevant information of the PCAP file or the data packet, such as a timestamp, a MAC address, an IP address, and the like. By means of the information, the time sequence of each data packet and the object with which each data packet is used for interaction can be obtained, that is, whether a certain data packet belongs to a file to be detected can be detected according to the file header and the data packet header of the PCAP file.

However, except for the timestamp information, other information in the file header and the data packet header is generally irrelevant to the behavior of the file to be detected, so that when the WebShell detection is performed, after each data packet sequence in the flow data of the file interaction to be detected is obtained according to the timestamp information, information irrelevant to the behavior of the file to be detected in the file header and the data packet header belongs to noise items when the subsequent WebShell detection is performed based on the spatio-temporal characteristics, and can be removed by a data cleaning means. For example, all the header and the data header are removed, or the MAC address of the physical layer, the IP address of the network layer, etc. in the header and the data header are removed, and only the payload portion is reserved.

The data packet is represented and stored in binary form, so that the data packet can be conveniently converted into tensor data which can be processed by a convolutional neural network. However, the data packet in the binary form is directly used as tensor data, and on one hand, the dimensionality of the data packet is only one-dimensional, and the data volume is large; on the other hand, the behavior characteristics are sparsely distributed in the binary data, so that the speed of the network convergence of the convolutional neural network trained by directly using the binary data is not high. Therefore, in the embodiment of the application, the payload of the data packet is converted into the eight-bit grayscale image, and then the two-dimensional tensor data is obtained based on the eight-bit grayscale image, so that the ascending dimension of the payload data of the data packet is realized, and the behavior characteristics of the payload data are visualized.

The payload of the packet may be represented in hexadecimal form, for example, hexadecimal FF represents decimal 255, and binary eight-bit data can represent just any one value from 0 to 255 in decimal, so that two bits of data (i.e., one byte) per hexadecimal in the payload of the packet in hexadecimal representation can represent one gray value (i.e., pixel value) of an eight-bit gray scale map. Thus, each byte in the payload of a packet can be mapped to a pixel in an eight-bit grayscale image whose grayscale value is the value that the byte represents.

In the traffic data, the payload length of each packet may be different, and in this embodiment, the payload may be processed in at least two ways to generate an eight-bit grayscale image with a preset length and width. One way is to truncate or pad all payloads to the same length and then convert to an eight-bit grayscale image of the same size, which can be uniformly padded with 0 or 255 or any other same value. The other mode is to cut off all overlong payloads into two parts and respectively generate two eight-bit gray images with the same size according to the sequence, and the same filling mode is also adopted for filling the overlong data packets or the overlong parts obtained after cutting off.

The eight-bit gray scale image is preferably a regular rectangular image, i.e., an eight-bit gray scale image having the same number of pixels in the length direction and the width direction, for example, 32 pixels × 32 pixels, or an eight-bit gray scale image having 64 pixels × 64 pixels. The size of the specific eight-bit grayscale image can be determined based on the statistics of the payload length of the data packet, so as to retain the data related to the original payload in each eight-bit grayscale image as much as possible and reduce the proportion of padding data as much as possible, thereby improving the efficiency and the detection rate of the training or identification of the convolutional neural network.

Convolutional neural networks have advantages in processing image data. In step S103 of this embodiment, the trained convolutional neural network is used to extract spatial features of each tensor data in the tensor data sequence, so as to obtain a first output feature map sequence for characterizing the spatial features of each data packet, where each tensor data corresponds to one first output feature map.

Since the Long Short-Term Memory network (LSTM) is used to extract the temporal features of the traffic data in step S103, and the feature dimension (length) of the input feature map of the LSTM is determined based on the number of neurons in each hidden layer of the LSTM, taking the number of neurons in each hidden layer as 160 as an example, the length of the input feature map of the LSTM should be 160. When the feature dimension of the output feature map (i.e., the third output feature map) of the spatial features of each tensor data extracted by the convolutional neural network is different from the feature dimension of the input feature map of the LSTM, the feature dimension of the third output feature map may be rearranged, so that the feature dimension of the first output feature map obtained by rearrangement is the same as the feature dimension of the input feature map of the LSTM.

For example, when the third output feature map is 1 × 1600, the third output feature map may be rearranged into an input feature map of 10 × 160 when temporal features are extracted using an LSTM with a neuron number of 160 for each hidden layer.

Fig. 2 is a schematic structural diagram of the WebShell detection model provided in this embodiment, and as shown in fig. 2, the WebShell detection model mainly includes two parts: the method comprises a Convolution Neural Network (CNN) for extracting spatial features and a long-short term memory network (LSTM) for extracting temporal features, wherein an input feature map of the CNN is a single-channel gray image with the size of 40 x 40, and a 1 x 1600 output feature map is obtained through full-connection layer processing after two times of convolution and pooling operations. The output feature map of 1 x 1600 is rearranged (Reshape) and then used as an input feature map of the LSTM, and the LSTM with two hidden layers and the number of neurons of each hidden layer being 160 is processed to finally obtain a second output feature map of the spatiotemporal features characterizing the flow data, the second output feature map is processed by a classifier to obtain a first detection result of the flow data, the first detection result includes the confidence coefficient that the file to be detected is the WebShell, and when the confidence coefficient is higher than a set value, the file to be detected can be considered as the WebShell.

The WebShell detection model is formed by joint training in a supervised learning mode, wherein in a training sample, a positive sample is obtained by building an HTTP website, then performing WebShell behaviors such as uploading, using and deleting by using WebShell behaviors such as ice scorpions and the like to perform flow interaction, and capturing by adopting flow capturing modes such as WireShark and the like. Negative examples are obtained by capturing normal Web traffic.

By the method, the space-time characteristics related to the behaviors of the file to be detected are obtained based on the flow data, so that the detection result of whether the file to be detected belongs to the WebShell is obtained, and compared with the traditional detection method based on static characteristics or other space characteristics, the detection rate of the WebShell is improved, and the problem that the detection method based on the space characteristics fails due to code confusion can be fundamentally prevented.

The WebShell detection method based on the flow data can obtain a detection result after a certain amount of flow data needs to be collected, and when possible WebShell needs to be detected more quickly and loss is avoided as much as possible, the WebShell which can be identified can be detected by adopting a detection mode based on the spatial characteristics, and the WebShell which cannot be identified by the detection mode based on the spatial characteristics is detected by adopting the WebShell detection method based on the flow data.

These spatial feature-based detection modes include, but are not limited to, at least one of: the method comprises a WebShell detection method based on an Opcode code and a WebShell detection method based on static characteristics of a source code. Fig. 3 is a flowchart of the multi-angle WebShell detection method provided in this embodiment, and as shown in fig. 3, the flowchart includes the following steps:

step S301 starts.

And step S302, performing WebShell detection on the file to be detected based on the static characteristics of the source code.

And step S303, if the detection result shows that the confidence coefficient of the to-be-detected file WebShell is greater than the set value, executing step S309, otherwise executing step S304.

And step S304, performing WebShell detection on the file to be detected based on the Opcode code.

And S305, if the detection result shows that the confidence coefficient of the to-be-detected file WebShell is greater than the set value, executing S309, otherwise, executing S306.

Step S306, the WebShell detection method based on traffic data shown in steps S101 to S105 is executed.

And step S307, if the detection result shows that the confidence coefficient of the to-be-detected file WebShell is smaller than a set value, executing step S308, otherwise, executing step S309.

And step S308, determining that the file to be detected is not WebShell, and ending the detection.

And step S309, determining that the file to be detected is WebShell, and ending the detection.

It should be noted that, although fig. 3 shows that the WebShell detection based on the static feature of the source code is performed on the file to be detected first, and then the WebShell detection based on the Opcode code is performed on the file to be detected, the order of the two detections is not limited in the present application, and for example, the WebShell detection based on the Opcode code and then the WebShell detection based on the static feature of the source code may be performed on the file to be detected first. Or the two spatial feature-based WebShell detection methods can be simultaneously performed, and then whether the WebShell detection based on the flow data is performed again is determined based on the detection result.

In addition, the set values of the confidence degrees of the detection methods can be the same, or can be respectively set based on the characteristics of different detection models, and the set values of the confidence degrees can be set to be larger or smaller than that of a conventional single WebShell detection model, so that the stability of the whole scheme or the optimization of the false alarm rate can be obtained.

For example, the stability of the WebShell detection method based on the static features of the source code is relatively poor, and when a single model is adopted, the set value of the confidence coefficient is usually small, for example, set to 0.6, in order to reduce the false negative rate of the WebShell. When the set value of the confidence coefficient is 0.6, although the undetected rate of the WebShell is reduced, the situation that the normal file is mistakenly reported as the WebShell occurs. When multi-angle WebShell detection is adopted, the set value of confidence level in the WebShell detection method based on the static characteristics of the source codes can be set to be a value larger than 0.6, and therefore the occurrence of false alarm is reduced.

For another example, the stability of the WebShell detection method based on the Opcode code is relatively high, and when a single model is used, the set value of the confidence level is usually large, for example, set to 0.8, in order to avoid the situation that the normal file is mistaken for the WebShell. When the set value of the confidence coefficient is 0.8, although the situation that the normal file is misreported to the WebShell is avoided, the missing rate of the WebShell is not negligible. After the multi-angle WebShell detection is adopted, the WebShell detection based on the flow data is also carried out after the WebShell detection based on the Opcode code, so that the set value of the confidence level in the WebShell detection method based on the Opcode can be set to be a value smaller than 0.8.

Fig. 4 is a flowchart of the WebShell detection method based on the Opcode code provided in this embodiment, where the file to be detected is a PHP script. As shown in fig. 4, the process includes the following steps:

step S401, extracting an Opcode code of the PHP script by using the PHP extension plug-in VLD.

And S402, performing semantic detection on the Opcode code by using a first semantic detection model based on deep learning to obtain a second detection result of the PHP script, wherein the second detection result represents a first confidence coefficient of the PHP script belonging to WebShell.

Step S403, when the first confidence is smaller than the first preset value, obtaining a first detection result of the PHP script based on the WebShell detection method shown in steps S101 to S105.

In some embodiments, when the first confidence is not less than the first preset value, the PHP script is determined to be WebShell. By the method, quick identification of part of WebShell is realized, and the risk of damage to the web server is reduced.

The first semantic detection model based on deep learning described above may be a TextCNN-based detection model. The training of the TextCNN model comprises the following steps: the steps of data preparation, VLD extension plug-in installation, Opcode code extraction, model construction, model optimization, etc. are described separately below.

Step 1: and (4) preparing data.

In the training sample of this embodiment, the data of the positive sample is derived from a WebShell collection project on a GitHub and a known open source CMS system, and includes 10 large-scale known WebShell open source projects such as PHP-WebShell, and the total number of collected PHP-type WebShell files is 6328; the negative samples are collected in 6 open source projects such as PHP-begin, 7512 normal PHP files are collected in total, the positive and negative samples have strong balance, and the problem that the samples drift to a large sample case in the training process is solved.

Step 2: vld expansion plug-in installation.

Firstly, downloading and installing PHP language environment, downloading Vld extension plug-in to a local system, installing Vld plug-in to the local environment, and executing system commands by using a program:

php-dvld.active =1-dvld.execute =0 filename php

And obtaining the Opcode data of each PHP file by using the regular expression.

And step 3: and constructing a TextCNN model.

Keras is used as a deep learning frame, TextCNN is used as a model network, the word window size is 3, 4 and 5, word vector embedding is carried out by using WordEmbedding, the dimensionality is 30, a two-class cross entropy loss function binary-entropy is selected as a loss function, the input is the first 1000 Opcode sequences, and the output is the probability of judging a WebShell file.

And 4, step 4: and (6) optimizing the model.

And performing data training in a supervised learning mode according to the constructed network model structure, and training the network model by adopting a deep learning model optimization framework Keras to obtain a first semantic detection model.

Fig. 5 is a flowchart of the WebShell detection method based on the static feature of the source code provided in this embodiment, and as shown in fig. 5, the flowchart includes the following steps:

step S501, extracting static characteristics of the source code of the file to be detected, and normalizing the static characteristics.

And step S502, performing source code semantic detection on the normalized static features by using a second semantic detection model based on deep learning to obtain a third detection result of the file to be detected, wherein the third detection result represents a second confidence coefficient of the file to be detected belonging to WebShell.

Step S503, when the second confidence is smaller than the second preset value, obtaining the first detection result of the PHP script based on the WebShell detection method shown in steps S101 to S105, and/or obtaining the first detection result or the second detection result of the PHP script based on the WebShell detection method shown in steps S401 to S403.

In some embodiments, when the second confidence is not less than the second preset value, it is determined that the file to be detected is WebShell. By the method, quick identification of part of WebShell is realized, and the risk of damage to the web server is reduced.

In some of these embodiments, the second semantic detection model includes a Text CNN-based source code detection model and a binary network-based feature detection model. The step S502 includes the following steps:

step S502-1, performing semantic recognition on part or all of the source codes of the file to be detected by using a source code detection model to obtain a fourth detection result.

And S502-2, carrying out feature classification on the normalized static features by using a feature detection model to obtain a fifth detection result.

And S502-3, combining the fourth detection result and the fifth detection result to obtain a third detection result of the file to be detected.

By adopting the second semantic detection model which is formed by compounding the source code detection model based on the Text CNN and the feature detection model based on the binary network, the classification features of the second semantic detection model are obtained respectively based on the semantic features of the source code and the normalized static features of the source code, so that the discrimination capability of the second semantic detection model is improved.

In the above step S502-3, the fourth detection result and the fifth detection result may be normalized classification feature maps, the merging of the classification feature maps is also called feature fusion, and the fusion manner may be a concat method or an add method, which is not limited in this embodiment.

In some embodiments, in the case that the training samples used in training the second semantic detection model are subjected to code cleaning, before step S501, the source code of the file to be detected may also be subjected to code cleaning to remove invalid characters, so as to reduce noise interference. And the code cleaning is carried out, so that the noise interference in the static characteristics can be reduced, and the discrimination capability of the second semantic detection model is improved.

The training of the second semantic detection model based on deep learning includes: the steps of data preparation, data cleaning and data feature extraction, model construction, model optimization and the like are respectively described below.

Step 1: and (4) preparing data.

In the training sample of the embodiment, the positive sample data is derived from a WebShell collection project on a GitHub and a famous open source CMS system, and comprises 10 large-scale famous WebShell open source projects such as PHP-WebShell, and the like, and 6328 PHP type WebShell files are collected; the negative samples are collected in 6 open source projects such as PHP-begin, 7512 normal PHP files are collected in total, the positive and negative samples have strong balance, and the problem that the samples drift to a large sample case in the training process is solved.

Step 2: data cleaning and data feature extraction.

And cleaning invalid characters such as comments in the data set file through data cleaning. And (3) extracting the characteristics of the file length and the file entropy (the complexity of the file), simultaneously carrying out characteristic normalization, wherein the mapping interval is [ -1, 1], and marking and randomly disordering the data set file.

And step 3: and (5) constructing a model.

And (3) establishing a two-classification network by using the textCNN as a main network and extracting file characteristics with strong correlation with the WebShell, such as file length, file entropy and the like, and combining the two to serve as a preliminary file detection model. The textCNN model is input as the first 1337 source code words after data cleaning, and is output as the probability of being recognized as a WebShell file. The input of the two-classification network model is the normalized file length and the file entropy characteristic, and the output is the WebShell file probability. And performing feature fusion (combination) on the outputs of the two models, and finally outputting the final WebShell recognition probability.

And 4, step 4: and (6) optimizing the model.

And (3) carrying out data training in a supervised learning mode according to the constructed network model structure, and training the model by adopting a deep learning model optimization framework Keras to obtain a second semantic detection model.

And respectively constructing and training a first semantic detection model, a second semantic detection model and a WebShell detection model based on flow data by adopting the network structure and the network parameters to obtain a combined model. 3286 test sample sets in the collected samples are tested through a combined model, and the detection rate of WebShell is verified to be higher than 99.75%. The detection rate of the combined model to the WebShell file adopting the confusion encryption does not obviously decrease. And the combined model also has a high detection rate for unknown WebShell characteristic behaviors.

In at least one of the WebShell detection methods provided by this embodiment, through multi-angle feature analysis of a WebShell file, collection of data training sets for model training is completed from different aspects, then different network detection models are constructed based on the characteristics of each data set, finally, a plurality of detection models are combined and strategy-superimposed, and detection and killing of the WebShell are performed from all-around multi-angles. The method solves the problems of low detection rate, poor detection application scene and high manual rule maintenance cost of the traditional method, and improves the semantic knowledge learning of the model for deeper feature knowledge by fully utilizing different differences of feature data sets of files, thereby improving the detection effect.

The present embodiment also provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to execute any WebShell detection method provided by the present embodiment.

The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any of the WebShell detection methods provided by the present embodiment.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A WebShell detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein extracting spatial features of each tensor data in the sequence of tensor data using a trained convolutional neural network to obtain a first sequence of output signatures characterizing the spatial features of each packet comprises:

3. The method of claim 1, wherein extracting the payload of each packet in the sequence of packets and converting the payload into tensor data, and wherein obtaining the sequence of tensor data comprises:

4. The method according to claim 1, wherein the file to be detected is a PHP script; before obtaining traffic data, the traffic data comprising a sequence of data packets, the method further comprises:

5. The method of claim 4, further comprising:

6. The method according to any one of claims 1 to 5, further comprising:

7. The method of claim 6, wherein the second semantic detection model comprises a Text CNN-based source code detection model and a binary network-based feature detection model; performing source code semantic detection on the normalized static features by using a second semantic detection model based on deep learning, and obtaining a third detection result of the file to be detected comprises the following steps:

8. The method according to claim 6, wherein before extracting the static features of the source code of the file to be detected, the method further comprises:

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the steps of the method according to any of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.