CN110719313A - Webshell detection method based on log session - Google Patents

Webshell detection method based on log session Download PDF

Info

Publication number
CN110719313A
CN110719313A CN201910278375.7A CN201910278375A CN110719313A CN 110719313 A CN110719313 A CN 110719313A CN 201910278375 A CN201910278375 A CN 201910278375A CN 110719313 A CN110719313 A CN 110719313A
Authority
CN
China
Prior art keywords
webshell
session
log
detection method
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910278375.7A
Other languages
Chinese (zh)
Inventor
黄诚
吴怡欣
孙宇强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910278375.7A priority Critical patent/CN110719313A/en
Publication of CN110719313A publication Critical patent/CN110719313A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

An attacker uploads the Webshell to a Web server so as to achieve the malicious purposes of stealing data, launching DDoS attack, modifying files and the like. Once these goals are achieved, they cause significant loss to the victim. With the gradual development of encryption and obfuscation techniques, the most common detection methods using taint analysis and feature matching may become no longer useful. The present invention proposes a simpler, more efficient framework that uses accurate sessions derived from Web logs to detect Webshell traffic, rather than applying source file code, POST content or traffic. Features are extracted from raw sequence data in Web logs while time interval based statistical methods are proposed to accurately identify conversations. In addition, the method forms a framework with long term short term memory neural networks. As long as Webshell communication is detected, abnormal sessions can be found out, and Webshell files can be accurately found by utilizing a statistical method.

Description

Webshell detection method based on log session
Technical Field
The invention provides a Webshell detection method based on log session, which is used for detecting Webshell access records possibly existing in logs. The detection method is divided into two modules, wherein the first module extracts key fields from log entries, encodes part of the key fields, calculates access time intervals among different entries in each session, performs statistics on further subdivided sessions based on the access time intervals, and constructs feature vectors. And the second module judges whether the Webshell communication exists in the session by using the LSTM neural network, positions a specific Webshell file by using a statistical method and lists the possibility that different files are Webshell files. The Webshell detection method detects and positions the Webshell under the condition of only acquiring the Web server logs, reduces input data and improves detection accuracy.
Background
Webshell is a type of script program, which runs on a Web server and a related interpreter (such as NodeJS, PHP, etc.), and provides a command execution environment for a controller, which can also be called a website backdoor. A controller of the Webshell can often complete sensitive operations such as file addition, deletion and modification, permission modification, code execution, database operation and the like through the Webshell. And the Webshell realizes the functions through the Web service, so that a firewall cannot intercept the Webshell under normal conditions, records cannot be left in a system log, and only data submission records are left in a Web access log of a website. In most cases, the Web log will not record the submitted specific parameters or data values, and it is difficult to identify the specific communication content.
Many security enthusiasts on the network provide scanning tools to detect whether the Webshell exists on the server, common tools include a D shield and the like, the tools judge whether a file is possibly the Webshell or not through a code auditing technology, but the tools have no influence on code execution caused by logic bugs in some normal files, such as the front-end arbitrary code execution bugs of SSV-96691 Typecho, wherein a normal installation program used as the Webshell cannot be discovered through the code auditing tool of the D shield.
With the development of machine learning technology, machine learning has achieved tremendous achievements in language processing, image recognition, and the like. The essence of Webshell recognition is that an abnormal sample is found in a normal log access record, and the log can be completely detected through a machine learning technology to find a Webshell file which is possibly hidden in the log.
Based on the above thought, a Webshell detection method based on a log session sequence is provided to reflect the security of a website.
Disclosure of Invention
In order to detect the Webshell more simply and efficiently, the invention provides a detection method based on log conversation. The session is extracted from a log of a Web server, and is accurately identified by a statistical method based on time intervals. And then, detecting according to the established machine learning classification model by using the session construction feature vector. The method based on the log session breaks away from the Webshell file, avoids the interference of a complicated encryption confusion technology in the Webshell, greatly reduces the workload, and maintains higher accuracy and recall rate. The method mainly comprises a data processing module and a machine learning module.
A data processing module: the module mainly realizes feature extraction in log entries, accurate identification of sessions and construction of feature vectors. Extracting key fields from the log entries, encoding part of the fields, calculating access time intervals among different entries in each session, counting further subdivision sessions based on the access time intervals, and constructing feature vectors.
A machine learning module: the module utilizes a long short-term memory neural network (LSTM) which is commonly used for processing sequence data to construct a model, inputs a feature vector constructed in the previous module, and judges whether Webshell communication exists in user access. Once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.
Drawings
Fig. 1 is a schematic diagram of the framework of the present invention.
FIG. 2 is a schematic diagram of feature extraction in a data processing module according to the present invention.
FIG. 3 is a schematic diagram of session identification in a data processing module of the present invention.
FIG. 4 is a schematic diagram of the machine learning module of the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings and detailed description. The invention relates to a Webshell detection method based on log conversation, which consists of a data processing module and a machine learning module. Fig. 1 is a schematic diagram of the framework of the present invention. FIG. 2 is a schematic diagram of feature extraction in a data processing module according to the present invention. FIG. 3 is a schematic diagram of session identification in a data processing module of the present invention. FIG. 4 is a schematic diagram of session identification in the machine learning module of the present invention.
A data processing module: the module is mainly divided into two parts of feature extraction and session identification. Firstly, extracting the contents of host, request, time, referrer, status, bytes and user-agent fields in collected log entries; the content of the request field is subdivided into two fields, namely a method field and a path field, which respectively represent the method and the path of the request. The bytes, status code field is then used directly to construct the feature vector. When the method field is POST, 2 is taken, when the method field is GET, 1 is taken, and the rest are 0. The value in the refererr field is null and is recorded as 0, the link from the home station is recorded as 1, the link from the outstation is recorded as 3, and if the web crawler is 2. The Path field is encoded with the degree of correlation of the access Path of each entry in the session. The first entry of each session is coded as-1, because it has no access in front, the relative relationship cannot be distinguished; and accessing the same file and marking the same file as 0, when accessing different files, calculating the distance of the different files according to the directory switching times to encode, and finally adding 1 to all values. The session identification part firstly performs rough identification through host IP and user-agent. The time difference between entries in each session is then calculated and counted, taking the value of the 70% quantile as a threshold for whether to continue subdividing the session (the 70% quantile is the most appropriate threshold from the experiment). That is, in each roughly generated session, the time interval between every two entries is compared with a threshold value, and if it is greater than the threshold value, the session is subdivided into two small sessions, and so on for accurate identification.
A machine learning module: and constructing a model by using a long and short term memory neural network (LSTM) which is commonly used for processing sequence data, inputting the preprocessed feature vector, and judging whether Webshell communication exists in user access. Once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.

Claims (3)

1. A Webshell detection method based on log session is characterized in that: the device comprises a data processing module and a machine learning module.
2. The Webshell detection method based on the log session as recited in claim 1, wherein the data processing comprises the following specific steps:
A. extracting a log generated when the Web server operates;
B. extracting host IP, request, time, referrer, status, bytes and user-agent fields in the log entry by using a regular matching mode;
C. subdividing a request field into a method and a path which respectively represent a request method and a request path;
D. encoding a method character: taking 2 when the position is POST, taking 1 when the position is GET, and taking 0 when the position is other;
E. encoding the referrer field: when the value is null, the link from the local station is marked as 1, the link from the external station is marked as 3, and if the link is a web crawler, the link is marked as 2;
F. the path field is encoded with the relevance of each entry access path in the session: the first entry of each session is coded as-1, because it has no access in front, the relative relationship cannot be distinguished; accessing the same file and marking as 0, when accessing different files, calculating the distance of different files according to the switching times of the directory to encode, and finally adding 1 to all values;
g. perform rough session identification with user-agent through host IP;
H. calculating and counting the time difference between the items in each session, and taking the quantile value at 70% as a threshold value for judging whether to continue to subdivide the session (the 70% quantile value is the most appropriate threshold value obtained by experiments);
I. in each roughly generated session, the time interval between every two entries is compared with a threshold value, and if greater than this value, the session is subdivided into two smaller sessions, and so on for accurate identification.
3. The Webshell detection method based on the log session as claimed in claim 1, wherein: the specific steps of machine learning include:
A. constructing a model by using a long and short term memory neural network (LSTM) which is commonly used for processing sequence data, inputting the preprocessed feature vector, and judging whether Webshell communication exists in user access;
B. once Webshell communication is detected, we can pinpoint anomalous sessions and accurately find the Webshell file using statistical methods.
CN201910278375.7A 2019-04-09 2019-04-09 Webshell detection method based on log session Pending CN110719313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910278375.7A CN110719313A (en) 2019-04-09 2019-04-09 Webshell detection method based on log session

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910278375.7A CN110719313A (en) 2019-04-09 2019-04-09 Webshell detection method based on log session

Publications (1)

Publication Number Publication Date
CN110719313A true CN110719313A (en) 2020-01-21

Family

ID=69208769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910278375.7A Pending CN110719313A (en) 2019-04-09 2019-04-09 Webshell detection method based on log session

Country Status (1)

Country Link
CN (1) CN110719313A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738109A (en) * 2020-12-30 2021-04-30 杭州迪普科技股份有限公司 Web attack detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101348590B1 (en) * 2012-12-12 2014-01-09 (주)론스텍 Method for breaking webshell using network in-line filtering
CN107026821A (en) * 2016-02-01 2017-08-08 阿里巴巴集团控股有限公司 The processing method and processing device of message
CN108156131A (en) * 2017-10-27 2018-06-12 上海观安信息技术股份有限公司 Webshell detection methods, electronic equipment and computer storage media

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101348590B1 (en) * 2012-12-12 2014-01-09 (주)론스텍 Method for breaking webshell using network in-line filtering
CN107026821A (en) * 2016-02-01 2017-08-08 阿里巴巴集团控股有限公司 The processing method and processing device of message
CN108156131A (en) * 2017-10-27 2018-06-12 上海观安信息技术股份有限公司 Webshell detection methods, electronic equipment and computer storage media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
彭薇: "网站Web日志数据预处理模型的建立", 《企业科技与发展》 *
石刘洋等: "基于Web日志的Webshell检测方法研究", 《信息安全研究》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738109A (en) * 2020-12-30 2021-04-30 杭州迪普科技股份有限公司 Web attack detection method and device

Similar Documents

Publication Publication Date Title
CN112738039B (en) Malicious encrypted flow detection method, system and equipment based on flow behavior
CN109714322B (en) Method and system for detecting network abnormal flow
CN106961419B (en) WebShell detection method, device and system
CN107888554B (en) Method and device for detecting server attack
CN107888571B (en) Multi-dimensional webshell intrusion detection method and system based on HTTP log
CN101686239B (en) Trojan discovery system
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN103748853A (en) Method and system for classifying a protocol message in a data communication network
CN110392013A (en) A kind of Malware recognition methods, system and electronic equipment based on net flow assorted
CN112953971A (en) Network security traffic intrusion detection method and system
CN113704328B (en) User behavior big data mining method and system based on artificial intelligence
CN113079150B (en) Intrusion detection method for power terminal equipment
CN114785563A (en) Encrypted malicious flow detection method for soft voting strategy
CN112464295A (en) Communication maintenance safety device based on electric power edge gateway equipment
CN111464510A (en) Network real-time intrusion detection method based on rapid gradient lifting tree model
CN113132329A (en) WEBSHELL detection method, device, equipment and storage medium
CN107888576B (en) Anti-collision library safety risk control method using big data and equipment fingerprints
CN110719313A (en) Webshell detection method based on log session
CN115051874B (en) Multi-feature CS malicious encrypted traffic detection method and system
CN116418587A (en) Data cross-domain switching behavior audit trail method and data cross-domain switching system
CN111371727A (en) Detection method for NTP protocol covert communication
CN112597498A (en) Webshell detection method, system and device and readable storage medium
CN117240598B (en) Attack detection method, attack detection device, terminal equipment and storage medium
CN115174270B (en) Behavior abnormity detection method, device, equipment and medium
CN112118089B (en) Webshell monitoring method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200121