WO2022032917A1 - 一种基于RNN的Webshell检测方法及装置 - Google Patents

一种基于RNN的Webshell检测方法及装置 Download PDF

Info

Publication number
WO2022032917A1
WO2022032917A1 PCT/CN2020/130234 CN2020130234W WO2022032917A1 WO 2022032917 A1 WO2022032917 A1 WO 2022032917A1 CN 2020130234 W CN2020130234 W CN 2020130234W WO 2022032917 A1 WO2022032917 A1 WO 2022032917A1
Authority
WO
WIPO (PCT)
Prior art keywords
rnn
detection method
source file
webshell detection
gru model
Prior art date
Application number
PCT/CN2020/130234
Other languages
English (en)
French (fr)
Inventor
张秀华
Original Assignee
紫光云(南京)数字技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 紫光云(南京)数字技术有限公司 filed Critical 紫光云(南京)数字技术有限公司
Publication of WO2022032917A1 publication Critical patent/WO2022032917A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to an RNN-based Webshell detection method and device.
  • WebShell is a command execution environment in the form of web page files such as asp, php, jsp or cgi, and it can also be called a web page backdoor. After an attacker invades a website, they usually mix these asp or php backdoor files with the normal web page files in the web directory of the website server, and then use a browser to access these backdoors to get a command execution environment to control the website. The purpose of the server, this is the webshell file upload attack.
  • Webshell can be divided into 2 categories, one is the pony, the other is the big horse.
  • Pony the source file has a small amount of code, usually ranging from a few lines to dozens of lines, and its functions are mainly file uploading, executing command-line programs, and so on.
  • the file size can range from a few KB to hundreds of KB, or even more than 1MB.
  • the functions are complex, including executing command-line programs, uploading files, privilege escalation, port scanning, database operations, etc.
  • Malaysia also needs the cooperation of other source files, and coordinated operations to achieve the purpose of attack.
  • the common detection methods to prevent webshell file upload attacks are as follows: 1), set the directory where the file is uploaded to be non-executable; 2), determine the file type, and perform access control in combination with a whitelist; 3), use random numbers Rewrite the file name and file type to increase the attack cost; 4), set the domain name of the file server separately.
  • the second method is the method of judging the file type, which generally uses MIME Type, suffix check, packet type magic word matching, etc. to judge the file type, and this method is easy for hackers to modify the suffix, add after the legal file Trojans and other methods bypass detection.
  • the present invention proposes an RNN-based Webshell detection method and device, which is suitable for the application of various distributed computing server-side cyberspace security protection for the majority of computer practitioners.
  • An efficient detection method is provided to overcome the above-mentioned technical problems existing in the related art.
  • a kind of RNN-based Webshell detection method comprising the following steps:
  • the source file is discriminated through the GRU model of the gated recurrent unit.
  • the S1 preprocesses the source file by a preset method, and obtaining the keywords includes the following steps:
  • the preset word segmentation in S11 includes non-alphabetic characters and non-numeric characters, and the string lengths of the non-alphabetic characters and the non-numeric characters are both between 3 and 15.
  • i the word
  • j the document
  • tf i the frequency of the word i in the document j
  • df i the number of documents containing the word i
  • N the total number of documents.
  • the S2 adopts a preset rule to construct the GRU model of the gated recurrent unit, and the training includes the following steps:
  • represents the sigmoid function
  • the value range is [0, 1], corresponding to each gate
  • x represents the input
  • t represents the time
  • the value range is [1, T]
  • l represents the layer
  • W and U respectively represent the corresponding weight matrix.
  • the symbol ⁇ represents the multiplication of the corresponding elements
  • represents the sigmoid function
  • the value range is [0, 1], corresponding to each gate respectively
  • x represents the input
  • t represents the time
  • the value range is [1, T]
  • l represents the layer
  • W and U respectively represent the corresponding weight matrix.
  • step of discriminating the source file by the gated recurrent unit GRU model in S3 includes the following steps:
  • the GRU model of the gated loop unit determines whether the source file is a command execution environment webshell according to the keyword.
  • an electronic device the electronic device includes a memory and a processor, the memory stores an RNN-based Webshell detection program that can run on the processor, and the The RNN-based Webshell detection program is executed by the processor to implement the steps of the above RNN-based Webshell detection method.
  • the beneficial effects are: from the perspective of approximating samples from a keyword set, by extracting keywords and using a keyword set corresponding to the sample to approximate the sample, the useless noise in the sample is effectively eliminated, compared with traditional commonly used machines.
  • the present invention can extract deep-level features, thereby not only effectively improving the detection accuracy, but also effectively reducing the false positive rate and the false negative rate, so that the present invention can more effectively realize the detection of webshell .
  • FIG. 1 is a flowchart of a RNN-based Webshell detection method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of the information flow of the GRU model in an RNN-based Webshell detection method according to an embodiment of the present invention.
  • an RNN-based Webshell detection method and apparatus are provided.
  • RNN Recurrent Neural Network
  • a cyclic structure which is widely used in speech recognition, machine translation, text processing and other fields.
  • GRU Gate Recurrent Unit
  • GRU is one of the many variants of LSTM (Long Short Term Memory), which has been widely used in many fields.
  • an RNN-based Webshell detection method including the following steps:
  • the S1 includes the following steps:
  • the word segmentation is to segment the text data to obtain words with certain practical significance. Since there are English and Chinese in the PHP (Hypertext Preprocessor) source code, as well as a large number of operation symbols and punctuation marks, Chinese characters are generally used as comments or variable values, and English characters are the main part of the program code, so non-alphabets and non-numbers are used. characters as cut words. From the relevant code statistics, it is concluded that strings are mainly concentrated in short strings of length less than 15, and strings of length less than 4 are generally meaningless. Therefore, in the word segmentation dataset, only strings with a string length between 3 and 15 are kept.
  • the preset word segmentation in S11 includes non-alphabetic characters and non-numeric characters, and the string lengths of the non-alphabetic characters and the non-numeric characters are both between 3 and 15.
  • TF-IDF term frequency-inverse document frequency
  • Webshell source files mainly call system functions, including file operations, execution of command-line programs, etc. Normal source files generally have names that represent meaning, and the characteristics of webshell files are not obvious. Therefore, the TF-IDF algorithm can be effectively used for key word extraction.
  • i the word
  • j the document
  • tf i the frequency of the word i in the document j
  • df i the number of documents containing the word i
  • N the total number of documents.
  • each part of the source file serves for the program to realize the relevant functions and complement each other.
  • the program code is used to realize its function, and the comment is a supplement to the program code and exists depending on the specific program to improve readability and record important information.
  • Different source files have different comments. Therefore, in the S1, all information in the source file, including comments, etc., is retained before the source file is segmented.
  • the GRU used is a simplification of the complex structure of vanilla LSTM, which is a long short-term memory artificial neural network.
  • LSTM has only two gates: update gate and reset gate.
  • LSTM has a cyclic update of the cell state, while the cell is removed from the GRU, and more directly depends on the addition and multiplication of the GRU output h, as shown below:
  • the S2 includes the following steps:
  • represents the sigmoid function
  • the value range is [0, 1], corresponding to each gate
  • x represents the input
  • t represents the time
  • the value range is [1, T]
  • l represents the layer
  • W and U respectively represent the corresponding weight matrix.
  • the symbol ⁇ represents the multiplication of the corresponding elements
  • represents the sigmoid function
  • the value range is [0, 1], corresponding to each gate respectively
  • x represents the input
  • t represents the time
  • the value range is [1, T]
  • l represents the layer
  • W and U respectively represent the corresponding weight matrix.
  • equations 1-4 the information flow in the GRU structure is shown in equations 1-4, and the structure is shown in Figure 2, wherein the reset gate and the update gate both depend on the output at the previous moment and the input at the current moment There is an additive relationship between the two.
  • the two gates limit the throughput of the information at the previous moment through their range, that is, the output at the previous moment. Because of the existence of the gate, part of the information flows to the output of the current moment The remaining information is discarded by the output of the current moment.
  • equation 4 is for the intermediate state and Weighted average, biased towards one of the two depending on the value of the gate.
  • the source file is discriminated through the GRU model of the gated recurrent unit.
  • the S3 includes the following steps:
  • the GRU model of the gated loop unit determines whether the source file is a command execution environment webshell according to the keyword.
  • an electronic device is also provided.
  • the electronic device may be a computer or a server.
  • the electronic device includes at least a memory, a processor, a communication bus, and a network interface.
  • the memory includes at least one type of readable storage medium
  • the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory may in some embodiments be an internal storage unit of an electronic device, such as a hard disk of the electronic device.
  • the memory can also be an external storage device of the electronic device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, and a flash memory equipped on the electronic device.
  • Card Flash Card
  • the memory may also include both an internal storage unit of the electronic device and an external storage device.
  • the memory can not only be used to store application software installed in the electronic device and various types of data, such as the code of an RNN-based Webshell detection program, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip for executing program codes or processing data stored in the memory.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor microprocessor or other data processing chip for executing program codes or processing data stored in the memory.
  • the communication bus is used to realize the connection communication between these components.
  • the network interface may include a standard wired interface and a wireless interface (such as a WI-FI interface), which is generally used to establish a communication connection between the electronic device and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • WI-FI interface wireless interface
  • the electronic device may further include a user interface
  • the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device and for displaying a visual user interface.
  • the electronic device includes a memory and a processor, and the memory stores an RNN-based Webshell detection program that can run on the processor, and when the processor executes the RNN-based Webshell detection program stored in the memory Implement the following steps:
  • the source file is preprocessed by a preset method to obtain keywords; it includes the following steps: firstly, the source file is segmented by preset word segmentation to obtain a word segmentation result; then the word frequency-inverse document frequency TF-IDF is used An algorithm is used to extract keywords from the word segmentation result to obtain keywords.
  • the GRU model of the gated recurrent unit is constructed by using a preset rule, and the training is carried out; it includes the following steps: firstly obtaining the reset gate and the calculation equation of the update gate of the GRU model of the gated recurrent unit; then according to the reset gate and the update gate The calculation equation of the gate is used to obtain the output calculation equation of the gated recurrent unit GRU model;
  • the source file is discriminated by the GRU model of the gated recurrent unit. It includes the following steps: firstly, input the keyword into the trained GRU model of the gated recurrent unit; then the GRU model of the gated recurrent unit determines whether the source file is a command execution environment according to the keyword webshell.
  • the present invention approximates samples from a keyword set by extracting keywords and using the keyword set corresponding to the sample to approximate the sample, effectively excluding the samples in the sample.
  • Useless noise compared with the traditional commonly used machine learning algorithm, the present invention can extract deep-level features, thereby not only effectively improving the detection accuracy, but also effectively reducing the false positive rate and the false negative rate, thereby making the The invention can detect the webshell more effectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种基于RNN的Webshell检测方法及装置,所述方法包括以下步骤:S1、通过预设方法对源文件进行预处理,获取关键词;S2、采用预设法则构建门控循环单元GRU模型,并进行训练;S3、通过所述门控循环单元GRU模型对所述源文件进行判别。该方法从关键词集来近似样本的角度,通过提取关键词,使用样本对应的关键词集来近似表示样本,有效地排除了样本中的无用噪声,相比于传统常用的机器学习算法,能够提取深层次的特征,从而不仅有效地提高了检测的准确率,而且还有效地降低了误报率和漏报率,更加有效地实现对Webshell的检测。

Description

一种基于RNN的Webshell检测方法及装置 技术领域
本发明涉及互联网技术领域,具体来说,涉及一种基于RNN的Webshell检测方法及装置。
背景技术
WebShell就是以asp、php、jsp或者cgi等网页文件形式存在的一种命令执行环境,也可以将其称之为一种网页后门。攻击者在入侵了一个网站后,通常会将这些asp或php后门文件与网站服务器web目录下正常的网页文件混在一起,然后使用浏览器来访问这些后门,得到一个命令执行环境,以达到控制网站服务器的目的,这就是webshell文件上传攻击。
Webshell可分为2类,一类是小马,一类是大马。小马,源文件代码量较少,通常是几行到几十行不等,其功能主要是文件上传、执行命令行程序等。大马,文件大小少则几KB,多则几百KB,甚至超过1MB,功能复杂,包括执行命令行程序、上传文件、权限提升、端口扫描、数据库操作等。此外,大马要完成其功能还需要其他源文件的配合,协同作战,达到攻击目的。
当前防范webshell文件上传攻击常见的检测方法有以下几种:1)、将文件上传的目录设置为不可执行;2)、判断文件类型,结合白名单的方式进行访问控制;3)、使用随机数改写文件名和文件类型,增加攻击成本;4)、单独设置文件服务器的域名。其中第二种方法即判断文件类型的方法,普遍采用MIME Type、后缀检查、报文类型魔术字匹配等方式对文件类型进行判断,而此种方法很容易被黑客通过修改后缀、合法文件后添加木马等方法绕过检测。
针对相关技术中的问题,目前尚未提出有效的解决方案。
发明内容
针对相关技术中的问题,本发明提出一种基于RNN的Webshell检测方法及装置,适用于广大计算机从业人员的各类分布式计算的服务器端网络空间安全防护的应用场合,是一种对Webshell的高效检测的方法,以克服现有相关技术所存在的上述技术问题。
为此,本发明采用的具体技术方案如下:
根据本发明的一个方面,提供了一种基于RNN的Webshell检测方法,包括以下步骤:
S1、通过预设方法对源文件进行预处理,获取关键词;
S2、采用预设法则构建门控循环单元GRU模型,并进行训练;
S3、通过所述门控循环单元GRU模型对所述源文件进行判别。
进一步的,所述S1通过预设方法对源文件进行预处理,获取关键词包括以下步骤:
S11、通过预设切词对所述源文件进行切分处理,得到切词结果;
S12、采用词频-逆文档频率TF-IDF算法来对所述切词结果进行关键词提取,得到关键词。
进一步的,所述S1中在对所述源文件进行切分处理之前,保留有所述源文件中的所有信息。
进一步的,所述S11中的预设切词包括非字母字符和非数字字符,且所述非字母字符和所述非数字字符的字符串长度均介于3到15之间。
进一步的,所述S12中词频-逆文档频率TF-IDF算法的计算公式为:
Figure PCTCN2020130234-appb-000001
其中,i表示词,j表示文档,tf i,j表示词i在文档j中出现的频率,df i表示包含词i的文档数,N表示文档总数。
进一步的,所述S2采用预设法则构建门控循环单元GRU模型,并进行训练包括以下步骤:
S21、获取所述门控循环单元GRU模型的重置门及更新门的计算方程;
S22、依据所述重置门及更新门的计算方程来获取所述门控循环单元 GRU模型的输出计算方程。
进一步的,所述S21中重置门的计算方程为:
Figure PCTCN2020130234-appb-000002
所述更新门的计算方程为:
Figure PCTCN2020130234-appb-000003
其中,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。
进一步的,所述S22中门控循环单元GRU模型的输出计算方程为:
Figure PCTCN2020130234-appb-000004
Figure PCTCN2020130234-appb-000005
其中,符号⊙表示对应元素相乘,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。
进一步的,所述S3通过所述门控循环单元GRU模型对所述源文件进行判别包括以下步骤:
S31、向已训练的所述门控循环单元GRU模型中输入所述关键词;
S32、由所述门控循环单元GRU模型依据所述关键词来判别所述源文件是否为命令执行环境webshell。
根据本发明的另一个方面,还提供了一种电子装置,所述电子装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的基于RNN的Webshell检测程序,所述基于RNN的Webshell检测程序被所述处理器执行,以实现上述基于RNN的Webshell检测方法的步骤。
有益效果为:本发明从关键词集来近似样本的角度,通过提取关键词,使用样本对应的关键词集来近似表示样本,有效地排除了样本中的无用噪声,相比于传统常用的机器学习算法,本发明能够提取深层次的特征,从而不仅有效地提高了检测的准确率,而且还有效地降低了误报率和漏报率,进而使得本发明能够更加有效地实现对webshell的检测。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本发明实施例的一种基于RNN的Webshell检测方法的流程图;
图2是根据本发明实施例的一种基于RNN的Webshell检测方法中GRU模型的信息流向示意图。
具体实施方式
为进一步说明各实施例,本发明提供有附图,这些附图为本发明揭露内容的一部分,其主要用以说明实施例,并可配合说明书的相关描述来解释实施例的运作原理,配合参考这些内容,本领域普通技术人员应能理解其他可能的实施方式以及本发明的优点,图中的组件并未按比例绘制,而类似的组件符号通常用来表示类似的组件。
根据本发明的实施例,提供了一种基于RNN的Webshell检测方法及装置。其中,RNN(Recurrent Neural Network)递归神经网络是一种深度学习方法,具有循环结构,被广泛应用于语音识别、机器翻译、文本处理等领域。RNN中的循环结构,有多种选择,本发明选取GRU(Gated Recurrent Unit)。GRU是LSTM(Long Short Term Memory)诸多变体中的一个,在很多领域都得到了大量的使用。
现结合附图和具体实施方式对本发明进一步说明,如图1-2所示,根据本发明的一个实施例,提供了一种基于RNN的Webshell检测方法,包括以下步骤:
S1、通过预设方法对源文件进行预处理,获取关键词;具体的,所述的预处理是对源文件进行处理,以提高效率和识别准确率,预处理工作包括切词、提取关键词等。
其中,所述S1包括以下步骤:
S11、通过预设切词对所述源文件进行切分处理,得到切词结果;
在本实施中,所述的切词,对文本数据进行切分,得到具有一定实际意义的词。由于PHP(超文本预处理器)源码中有英文和中文,以及大量的运算符号和标点符号,中文字符一般作为注释或者变量值,英文字符是程序代码的主体部分,因此采用非字母和非数字字符作为切词。从相关代码统计数据得到结论,字符串主要集中在长度小于15的短字符串,而长度小于4的字符串一般没有实际意义。因此在切词数据集中,只保留字符串长度介于3到15之间的字符串。
具体的,所述S11中的预设切词包括非字母字符和非数字字符,且所述非字母字符和所述非数字字符的字符串长度均介于3到15之间。
S12、采用词频-逆文档频率TF-IDF算法来对所述切词结果进行关键词提取,得到关键词。
在本实施中,所述的提取关键词,出于效率的考虑,以及并不是每个词都有助于识别,由此,需要对切词结果进行取舍。本文采用TF-IDF(term frequency–inverse document frequency)算法来提取关键词。TF-IDF算法是基于词频-逆文档频率。Webshell源文件调用的主要是系统函数,包括文件操作、执行命令行程序等,正常源文件一般是名称代表意义,webshell文件这种特征并不明显,因此,TF-IDF算法可有效地用于关键词提取。
具体的,所述S12中词频-逆文档频率TF-IDF算法的计算公式为:
Figure PCTCN2020130234-appb-000006
其中,i表示词,j表示文档,tf i,j表示词i在文档j中出现的频率,df i表示包含词i的文档数,N表示文档总数。
此外,对于源文件从总体上来说,源文件中的每一部分,都是为程序实现相关功能服务的,相辅相成。程序代码用来实现其功能,而注释是对程序代码的补充,依赖于具体程序而存在,以提高可读性,以及记录重要信息。不同的源文件,注释不尽相同。因此,所述S1中在对所述源文件进行切分处理之前,保留有所述源文件中的所有信息,包括注释等。
本实施中,使用的GRU是对vanilla LSTM复杂结构的简化,LSTM结构即长短期记忆人工神经网络。原始的LSTM中有3个门,输入门、 输出门、遗忘门,GRU只有更新门和重置门2个门。另外,LSTM有cell状态的循环更新,而GRU中则去掉了cell,更多的直接依赖于GRU的输出h的加法和乘法运算,具体如下所示:
S2、采用预设法则构建门控循环单元GRU模型,并进行训练;
其中,所述S2包括以下步骤:
S21、获取所述门控循环单元GRU模型的重置门及更新门的计算方程;
具体的,所述S21中重置门的计算方程为:
Figure PCTCN2020130234-appb-000007
所述更新门的计算方程为:
Figure PCTCN2020130234-appb-000008
其中,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。
S22、依据所述重置门及更新门的计算方程来获取所述门控循环单元GRU模型的输出计算方程。
具体的,所述S22中门控循环单元GRU模型的输出计算方程为:
Figure PCTCN2020130234-appb-000009
Figure PCTCN2020130234-appb-000010
其中,符号⊙表示对应元素相乘,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。从上述方程中可见,GRU的相邻时刻的输出联系紧密,上一个时刻的输出贯穿了当前时刻的输出的整个计算流。一般地,对于这种分类问题,通常取GRU地最后一层地最后一个时刻的输出,即
Figure PCTCN2020130234-appb-000011
在本实施中,所述的GRU结构中信息流向见方程①-④,结构如图2所示,其中所述重置门和所述更新门都依赖于上一时刻的输出
Figure PCTCN2020130234-appb-000012
和当前时刻的输入
Figure PCTCN2020130234-appb-000013
两者之间为加法关系,对于上述GRU的输出计算方程③和④,两个门通过其值域限制了上一时刻信息的通过量,即上一时刻的输出
Figure PCTCN2020130234-appb-000014
因为门的存在部分信息流向当前时刻的输出
Figure PCTCN2020130234-appb-000015
而剩余 的信息则被当前时刻的输出丢掉。另外,方程④是对中间状态
Figure PCTCN2020130234-appb-000016
Figure PCTCN2020130234-appb-000017
加权平均,根据门的取值偏向于两者中的一个。
S3、通过所述门控循环单元GRU模型对所述源文件进行判别。
其中,所述S3包括以下步骤:
S31、向已训练的所述门控循环单元GRU模型中输入所述关键词;
S32、由所述门控循环单元GRU模型依据所述关键词来判别所述源文件是否为命令执行环境webshell。
根据本发明的另一个实施例,还提供了一种电子装置。
在本实施例中,所述电子装置可以是电脑或服务器。所述电子装置至少包括存储器、处理器、通信总线以及网络接口。
其中,存储器至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器在一些实施例中可以是电子装置的内部存储单元,例如所述电子装置的硬盘。存储器在另一些实施例中也可以是电子装置的外部存储设备,例如电子装置上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器还可以既包括电子装置的内部存储单元也包括外部存储设备。存储器不仅可以用于存储安装于电子装置的应用软件及各类数据,例如基于RNN的Webshell检测程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
处理器在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器中存储的程序代码或处理数据。
通信总线用于实现这些组件之间的连接通信。
网络接口可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在电子装置与其他电子设备之间建立通信连接。
可选地,电子装置还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可 以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子装置中处理的信息以及用于显示可视化的用户界面。
所述电子装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的基于RNN的Webshell检测程序,所述处理器执行所述存储器中存储的基于RNN的Webshell检测程序时实现如下步骤:
通过预设方法对源文件进行预处理,获取关键词;包括以下步骤:首先通过预设切词对所述源文件进行切分处理,得到切词结果;然后采用词频-逆文档频率TF-IDF算法来对所述切词结果进行关键词提取,得到关键词。
采用预设法则构建门控循环单元GRU模型,并进行训练;包括以下步骤:首先获取所述门控循环单元GRU模型的重置门及更新门的计算方程;然后依据所述重置门及更新门的计算方程来获取所述门控循环单元GRU模型的输出计算方程;
通过所述门控循环单元GRU模型对所述源文件进行判别。包括以下步骤:首先向已训练的所述门控循环单元GRU模型中输入所述关键词;然后由所述门控循环单元GRU模型依据所述关键词来判别所述源文件是否为命令执行环境webshell。
综上所述,借助于本发明的上述技术方案,本发明从关键词集来近似样本的角度,通过提取关键词,使用样本对应的关键词集来近似表示样本,有效地排除了样本中的无用噪声,相比于传统常用的机器学习算法,本发明能够提取深层次的特征,从而不仅有效地提高了检测的准确率,而且还有效地降低了误报率和漏报率,进而使得本发明能够更加有效地检测出webshell。
需要说明的是,上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者 是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于RNN的Webshell检测方法,其中,包括以下步骤:
    S1、通过预设方法对源文件进行预处理,获取关键词;
    S2、采用预设法则构建门控循环单元GRU模型,并进行训练;
    S3、通过所述门控循环单元GRU模型对所述源文件进行判别。
  2. 根据权利要求1所述的一种基于RNN的Webshell检测方法,其中,所述S1通过预设方法对源文件进行预处理,获取关键词包括以下步骤:
    S11、通过预设切词对所述源文件进行切分处理,得到切词结果;
    S12、采用词频-逆文档频率TF-IDF算法来对所述切词结果进行关键词提取,得到关键词。
  3. 根据权利要求2所述的一种基于RNN的Webshell检测方法,其中,所述S1中在对所述源文件进行切分处理之前,保留有所述源文件中的所有信息。
  4. 根据权利要求2所述的一种基于RNN的Webshell检测方法,其中,所述S11中的预设切词包括非字母字符和非数字字符,且所述非字母字符和所述非数字字符的字符串长度均介于3到15之间。
  5. 根据权利要求2所述的一种基于RNN的Webshell检测方法,其中,所述S12中词频-逆文档频率TF-IDF算法的计算公式为:
    Figure PCTCN2020130234-appb-100001
    其中,i表示词,j表示文档,tf i,j表示词i在文档j中出现的频率,df i表示包含词i的文档数,N表示文档总数。
  6. 根据权利要求1所述的一种基于RNN的Webshell检测方法,其中,所述S2采用预设法则构建门控循环单元GRU模型,并进行训练包括以下步骤:
    S21、获取所述门控循环单元GRU模型的重置门及更新门的计算方程;
    S22、依据所述重置门及更新门的计算方程来获取所述门控循环单元 GRU模型的输出计算方程。
  7. 根据权利要求6所述的一种基于RNN的Webshell检测方法,其中,所述S21中重置门的计算方程为:
    Figure PCTCN2020130234-appb-100002
    所述更新门的计算方程为:
    Figure PCTCN2020130234-appb-100003
    其中,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。
  8. 根据权利要求6所述的一种基于RNN的Webshell检测方法,其中,所述S22中门控循环单元GRU模型的输出计算方程为:
    Figure PCTCN2020130234-appb-100004
    Figure PCTCN2020130234-appb-100005
    其中,符号⊙表示对应元素相乘,σ表示sigmoid函数,值域为[0,1],分别对应于各个门,x表示输入,表示某时刻的输出,t表示时间,取值范围[1,T],l表示层,取值范围[1,L],W和U分别表示对应的权值矩阵。
  9. 根据权利要求1所述的一种基于RNN的Webshell检测方法,其中,所述S3通过所述门控循环单元GRU模型对所述源文件进行判别包括以下步骤:
    S31、向已训练的所述门控循环单元GRU模型中输入所述关键词;
    S32、由所述门控循环单元GRU模型依据所述关键词来判别所述源文件是否为命令执行环境webshell。
  10. 一种电子装置,其中,所述电子装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的基于RNN的Webshell检测程序,所述基于RNN的Webshell检测程序被所述处理器执行,以实现如权利要求1至9中任一项所述的基于RNN的Webshell检测方法的步骤。
PCT/CN2020/130234 2020-08-13 2020-11-19 一种基于RNN的Webshell检测方法及装置 WO2022032917A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010809947.2A CN112118225B (zh) 2020-08-13 2020-08-13 一种基于RNN的Webshell检测方法及装置
CN202010809947.2 2020-08-13

Publications (1)

Publication Number Publication Date
WO2022032917A1 true WO2022032917A1 (zh) 2022-02-17

Family

ID=73804912

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/130234 WO2022032917A1 (zh) 2020-08-13 2020-11-19 一种基于RNN的Webshell检测方法及装置

Country Status (2)

Country Link
CN (1) CN112118225B (zh)
WO (1) WO2022032917A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117579385A (zh) * 2024-01-16 2024-02-20 山东星维九州安全技术有限公司 一种快速筛查新型WebShell流量的方法、系统及设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733157B (zh) * 2021-04-01 2021-07-30 中国人民解放军国防科技大学 一种基于不可执行目录的文件上传方法、系统和介质
CN113761534A (zh) * 2021-09-08 2021-12-07 广东电网有限责任公司江门供电局 Webshell文件检测方法及系统
CN114499944B (zh) * 2021-12-22 2023-08-08 天翼云科技有限公司 一种检测WebShell的方法、装置和设备
CN114844698A (zh) * 2022-04-29 2022-08-02 深圳极联软件有限公司 一种分布式大数据的数据安全管控系统及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309304A (zh) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 一种文本分类方法、装置、设备及存储介质
US20190334948A1 (en) * 2016-12-16 2019-10-31 Huawei Technologies Co., Ltd. Webshell detection method and apparatus
CN111062034A (zh) * 2018-10-16 2020-04-24 中移(杭州)信息技术有限公司 一种Webshell文件检测方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516041B (zh) * 2017-08-17 2020-04-03 北京安普诺信息技术有限公司 基于深度神经网络的WebShell检测方法及其系统
CN109522716B (zh) * 2018-11-15 2021-02-23 中国人民解放军战略支援部队信息工程大学 一种基于时序神经网络的网络入侵检测方法及装置
CN110414219B (zh) * 2019-07-24 2021-07-23 长沙市智为信息技术有限公司 基于门控循环单元与注意力机制的注入攻击检测方法
CN110855661B (zh) * 2019-11-11 2022-05-13 杭州安恒信息技术股份有限公司 一种WebShell检测方法、装置、设备及介质
CN111078838B (zh) * 2019-12-13 2023-08-18 北京小米智能科技有限公司 关键词提取方法、关键词提取装置及电子设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190334948A1 (en) * 2016-12-16 2019-10-31 Huawei Technologies Co., Ltd. Webshell detection method and apparatus
CN111062034A (zh) * 2018-10-16 2020-04-24 中移(杭州)信息技术有限公司 一种Webshell文件检测方法、装置、电子设备及存储介质
CN110309304A (zh) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 一种文本分类方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHOU LONG, WANG CHEN, SHI YIN: "Research of Webshell Detection Based on RNN", COMPUTER ENGINEERING AND APPLICATIONS, vol. 56, no. 14, 26 August 2019 (2019-08-26), CN , pages 88 - 92, XP055900220, ISSN: 1002-8331, DOI: 10.3778/j.issn.1002-8331.1904-0420 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117579385A (zh) * 2024-01-16 2024-02-20 山东星维九州安全技术有限公司 一种快速筛查新型WebShell流量的方法、系统及设备
CN117579385B (zh) * 2024-01-16 2024-03-19 山东星维九州安全技术有限公司 一种快速筛查新型WebShell流量的方法、系统及设备

Also Published As

Publication number Publication date
CN112118225B (zh) 2021-09-03
CN112118225A (zh) 2020-12-22

Similar Documents

Publication Publication Date Title
WO2022032917A1 (zh) 一种基于RNN的Webshell检测方法及装置
WO2021068329A1 (zh) 中文命名实体识别方法、装置及计算机可读存储介质
CN110414219B (zh) 基于门控循环单元与注意力机制的注入攻击检测方法
CA3087534C (en) System and method for information extraction with character level features
WO2022041815A1 (zh) 基于深度学习的弱口令检测方法、装置和电子装置
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
WO2019041521A1 (zh) 用户关键词提取装置、方法及计算机可读存储介质
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
WO2019085328A1 (zh) 企业关系提取方法、装置及存储介质
WO2019085335A1 (zh) 利用新词发现投资标的的方法、装置及存储介质
CN111371806A (zh) 一种Web攻击检测方法及装置
CN109145216A (zh) 网络舆情监控方法、装置及存储介质
US11216701B1 (en) Unsupervised representation learning for structured records
CN109062972A (zh) 网页分类方法、装置及计算机可读存储介质
CN109598124A (zh) 一种webshell检测方法以及装置
WO2018095411A1 (zh) 一种网页聚类方法及装置
CN113095076A (zh) 敏感词识别方法、装置、电子设备及存储介质
CN111758098B (zh) 利用遗传编程的命名实体识别和提取
CN111783132A (zh) 基于机器学习的sql语句安全检测方法、装置、设备及介质
WO2018171295A1 (zh) 一种给文章标注标签的方法、装置、终端及计算机可读存储介质
CN114357443A (zh) 基于深度学习的恶意代码检测方法、设备与存储介质
Lei et al. Design and implementation of an automatic scanning tool of SQL injection vulnerability based on Web crawler
CN110413909B (zh) 基于机器学习的大规模嵌入式设备在线固件智能识别方法
WO2021056740A1 (zh) 语言模型构建方法、系统、计算机设备及可读存储介质
CN114662469A (zh) 情感分析方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20949410

Country of ref document: EP

Kind code of ref document: A1