CN110719313A

CN110719313A - Webshell detection method based on log session

Info

Publication number: CN110719313A
Application number: CN201910278375.7A
Authority: CN
Inventors: 黄诚; 吴怡欣; 孙宇强
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2020-01-21

Abstract

An attacker uploads the Webshell to a Web server so as to achieve the malicious purposes of stealing data, launching DDoS attack, modifying files and the like. Once these goals are achieved, they cause significant loss to the victim. With the gradual development of encryption and obfuscation techniques, the most common detection methods using taint analysis and feature matching may become no longer useful. The present invention proposes a simpler, more efficient framework that uses accurate sessions derived from Web logs to detect Webshell traffic, rather than applying source file code, POST content or traffic. Features are extracted from raw sequence data in Web logs while time interval based statistical methods are proposed to accurately identify conversations. In addition, the method forms a framework with long term short term memory neural networks. As long as Webshell communication is detected, abnormal sessions can be found out, and Webshell files can be accurately found by utilizing a statistical method.

Description

Webshell detection method based on log session

Technical Field

The invention provides a Webshell detection method based on log session, which is used for detecting Webshell access records possibly existing in logs. The detection method is divided into two modules, wherein the first module extracts key fields from log entries, encodes part of the key fields, calculates access time intervals among different entries in each session, performs statistics on further subdivided sessions based on the access time intervals, and constructs feature vectors. And the second module judges whether the Webshell communication exists in the session by using the LSTM neural network, positions a specific Webshell file by using a statistical method and lists the possibility that different files are Webshell files. The Webshell detection method detects and positions the Webshell under the condition of only acquiring the Web server logs, reduces input data and improves detection accuracy.

Background

Webshell is a type of script program, which runs on a Web server and a related interpreter (such as NodeJS, PHP, etc.), and provides a command execution environment for a controller, which can also be called a website backdoor. A controller of the Webshell can often complete sensitive operations such as file addition, deletion and modification, permission modification, code execution, database operation and the like through the Webshell. And the Webshell realizes the functions through the Web service, so that a firewall cannot intercept the Webshell under normal conditions, records cannot be left in a system log, and only data submission records are left in a Web access log of a website. In most cases, the Web log will not record the submitted specific parameters or data values, and it is difficult to identify the specific communication content.

Many security enthusiasts on the network provide scanning tools to detect whether the Webshell exists on the server, common tools include a D shield and the like, the tools judge whether a file is possibly the Webshell or not through a code auditing technology, but the tools have no influence on code execution caused by logic bugs in some normal files, such as the front-end arbitrary code execution bugs of SSV-96691 Typecho, wherein a normal installation program used as the Webshell cannot be discovered through the code auditing tool of the D shield.

With the development of machine learning technology, machine learning has achieved tremendous achievements in language processing, image recognition, and the like. The essence of Webshell recognition is that an abnormal sample is found in a normal log access record, and the log can be completely detected through a machine learning technology to find a Webshell file which is possibly hidden in the log.

Based on the above thought, a Webshell detection method based on a log session sequence is provided to reflect the security of a website.

Disclosure of Invention

In order to detect the Webshell more simply and efficiently, the invention provides a detection method based on log conversation. The session is extracted from a log of a Web server, and is accurately identified by a statistical method based on time intervals. And then, detecting according to the established machine learning classification model by using the session construction feature vector. The method based on the log session breaks away from the Webshell file, avoids the interference of a complicated encryption confusion technology in the Webshell, greatly reduces the workload, and maintains higher accuracy and recall rate. The method mainly comprises a data processing module and a machine learning module.

A data processing module: the module mainly realizes feature extraction in log entries, accurate identification of sessions and construction of feature vectors. Extracting key fields from the log entries, encoding part of the fields, calculating access time intervals among different entries in each session, counting further subdivision sessions based on the access time intervals, and constructing feature vectors.

A machine learning module: the module utilizes a long short-term memory neural network (LSTM) which is commonly used for processing sequence data to construct a model, inputs a feature vector constructed in the previous module, and judges whether Webshell communication exists in user access. Once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.

Drawings

Fig. 1 is a schematic diagram of the framework of the present invention.

FIG. 2 is a schematic diagram of feature extraction in a data processing module according to the present invention.

FIG. 3 is a schematic diagram of session identification in a data processing module of the present invention.

FIG. 4 is a schematic diagram of the machine learning module of the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings and detailed description. The invention relates to a Webshell detection method based on log conversation, which consists of a data processing module and a machine learning module. Fig. 1 is a schematic diagram of the framework of the present invention. FIG. 2 is a schematic diagram of feature extraction in a data processing module according to the present invention. FIG. 3 is a schematic diagram of session identification in a data processing module of the present invention. FIG. 4 is a schematic diagram of session identification in the machine learning module of the present invention.

A data processing module: the module is mainly divided into two parts of feature extraction and session identification. Firstly, extracting the contents of host, request, time, referrer, status, bytes and user-agent fields in collected log entries; the content of the request field is subdivided into two fields, namely a method field and a path field, which respectively represent the method and the path of the request. The bytes, status code field is then used directly to construct the feature vector. When the method field is POST, 2 is taken, when the method field is GET, 1 is taken, and the rest are 0. The value in the refererr field is null and is recorded as 0, the link from the home station is recorded as 1, the link from the outstation is recorded as 3, and if the web crawler is 2. The Path field is encoded with the degree of correlation of the access Path of each entry in the session. The first entry of each session is coded as-1, because it has no access in front, the relative relationship cannot be distinguished; and accessing the same file and marking the same file as 0, when accessing different files, calculating the distance of the different files according to the directory switching times to encode, and finally adding 1 to all values. The session identification part firstly performs rough identification through host IP and user-agent. The time difference between entries in each session is then calculated and counted, taking the value of the 70% quantile as a threshold for whether to continue subdividing the session (the 70% quantile is the most appropriate threshold from the experiment). That is, in each roughly generated session, the time interval between every two entries is compared with a threshold value, and if it is greater than the threshold value, the session is subdivided into two small sessions, and so on for accurate identification.

A machine learning module: and constructing a model by using a long and short term memory neural network (LSTM) which is commonly used for processing sequence data, inputting the preprocessed feature vector, and judging whether Webshell communication exists in user access. Once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.

Claims

1. A Webshell detection method based on log session is characterized in that: the device comprises a data processing module and a machine learning module.

2. The Webshell detection method based on the log session as recited in claim 1, wherein the data processing comprises the following specific steps:

A. extracting a log generated when the Web server operates;

B. extracting host IP, request, time, referrer, status, bytes and user-agent fields in the log entry by using a regular matching mode;

C. subdividing a request field into a method and a path which respectively represent a request method and a request path;

D. encoding a method character: taking 2 when the position is POST, taking 1 when the position is GET, and taking 0 when the position is other;

E. encoding the referrer field: when the value is null, the link from the local station is marked as 1, the link from the external station is marked as 3, and if the link is a web crawler, the link is marked as 2;

F. the path field is encoded with the relevance of each entry access path in the session: the first entry of each session is coded as-1, because it has no access in front, the relative relationship cannot be distinguished; accessing the same file and marking as 0, when accessing different files, calculating the distance of different files according to the switching times of the directory to encode, and finally adding 1 to all values;

g. perform rough session identification with user-agent through host IP;

H. calculating and counting the time difference between the items in each session, and taking the quantile value at 70% as a threshold value for judging whether to continue to subdivide the session (the 70% quantile value is the most appropriate threshold value obtained by experiments);

I. in each roughly generated session, the time interval between every two entries is compared with a threshold value, and if greater than this value, the session is subdivided into two smaller sessions, and so on for accurate identification.

3. The Webshell detection method based on the log session as claimed in claim 1, wherein: the specific steps of machine learning include:

A. constructing a model by using a long and short term memory neural network (LSTM) which is commonly used for processing sequence data, inputting the preprocessed feature vector, and judging whether Webshell communication exists in user access;

B. once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.