CN107659570B

CN107659570B - Webshell detection method and system based on machine learning and dynamic and static analysis

Info

Publication number: CN107659570B
Application number: CN201710903110.2A
Authority: CN
Inventors: 唐佳莉; 范渊; 莫金友
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2020-09-15
Anticipated expiration: 2037-09-29
Also published as: CN107659570A

Abstract

The embodiment of the invention provides a Webshell detection method and system based on machine learning and dynamic and static analysis, and relates to the technical field of Webshell detection. The method comprises the steps of obtaining a sample file, extracting static characteristics and dynamic characteristics of the sample file, obtaining a classification model according to the static characteristics, the dynamic characteristics and a machine learning algorithm, and analyzing a file to be detected by the classification model to obtain a detection result. The method adopts an analysis means combining dynamic and static states, the extracted features are more comprehensive, a machine learning algorithm combining various classification algorithms is adopted to learn a large number of Webshell samples and normal webpage samples to form a classification model, the stability of the classification model is higher, and the classification is more accurate; by adopting the classification model, Webshell and variants thereof can be effectively detected, the novel Webshell can be predicted, text confusion means can be well dealt with, and the defect that a feature code matching detection mode is adopted in the prior art is overcome.

Description

Webshell detection method and system based on machine learning and dynamic and static analysis

技术领域technical field

本发明涉及Webshell检测技术领域，具体而言，涉及一种基于机器学习与动静态分析的Webshell检测方法及系统。The invention relates to the technical field of Webshell detection, in particular to a Webshell detection method and system based on machine learning and dynamic and static analysis.

背景技术Background technique

随着互联网应用的蓬勃发展与互联网数据的极速增长，服务器安全问题日益严峻，而Webshell这类基于Web应用的后门程序对用户信息、甚至整个应用系统的危害极大，因此及时检测发现服务器的漏洞和后门，保证服务器的安全至关重要。With the vigorous development of Internet applications and the rapid growth of Internet data, server security problems are becoming more and more serious, and web application-based backdoor programs such as Webshell are extremely harmful to user information and even the entire application system. Therefore, timely detection and discovery of server vulnerabilities And backdoors, it is crucial to keep the server safe.

由于Webshell大多由脚本语言编写，易修改变形，其特征并非只限于特征码，还包括文件操作函数、恶意执行函数、文件注释大小、单行字符串长度、混淆程度等，当Webshell进行简单变种或将其特征码故意混淆时，传统方法即会漏报此类Webshell，即很容易通过混淆的方式绕过防火墙和杀毒软件的检测，故目前基于特征匹配的Webshell检测方法很难快速检测和识别Webshell的变种。Since most Webshells are written in scripting languages and are easy to be modified and deformed, their features are not limited to signature codes, but also include file operation functions, malicious execution functions, file comment size, single-line string length, and degree of confusion. When the feature code is deliberately obfuscated, the traditional method will miss such webshells, that is, it is easy to bypass the detection of firewalls and antivirus software through obfuscation. Therefore, the current webshell detection methods based on feature matching are difficult to quickly detect and identify the variant.

因此，如何克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性，应对Webshell的文本混淆手段，实现快速检测Webshell及其变种，一直以来都是本领域技术人员关注的重点。Therefore, how to overcome the singleness and hysteresis of the traditional Webshell detection method based on signature matching, deal with the text obfuscation method of Webshell, and realize the rapid detection of Webshell and its variants has always been the focus of those skilled in the art.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于机器学习与动静态分析的Webshell检测方法，以克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性，提高Webshell检测的准确性，快速检测Webshell及其变种。The purpose of the present invention is to provide a Webshell detection method based on machine learning and dynamic and static analysis, in order to overcome the singleness and lag of the traditional Webshell detection method based on feature code matching, improve the accuracy of Webshell detection, and quickly detect Webshell and its variant.

本发明的目的还在于提供一种基于机器学习与动静态分析的Webshell检测系统，以克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性，提高Webshell检测的准确性，快速检测Webshell及其变种。The purpose of the present invention is also to provide a Webshell detection system based on machine learning and dynamic and static analysis, so as to overcome the singleness and lag of the traditional Webshell detection method based on feature code matching, improve the accuracy of Webshell detection, and quickly detect Webshell and its variants.

为了实现上述目的，本发明实施例采用的技术方案如下：In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present invention are as follows:

第一方面，本发明实施例提出一种基于机器学习与动静态分析的Webshell检测方法，所述基于机器学习与动静态分析的Webshell检测方法包括：获取样本文件；提取所述样本文件的静态特征和动态特征；依据所述静态特征、所述动态特征和机器学习算法得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。In a first aspect, an embodiment of the present invention proposes a Webshell detection method based on machine learning and dynamic and static analysis. The Webshell detection method based on machine learning and dynamic and static analysis includes: acquiring a sample file; extracting static features of the sample file and dynamic features; a classification model is obtained according to the static features, the dynamic features and the machine learning algorithm, and the classification model analyzes the files to be detected and obtains the detection results.

进一步地，所述提取所述样本文件的静态特征和动态特征的步骤包括：对所述样本文件进行静态分析得到所述静态特征，其中，所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征；对所述样本文件进行动态分析得到所述动态特征，其中，所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Further, the step of extracting the static features and dynamic features of the sample file includes: performing static analysis on the sample file to obtain the static features, wherein the static features include document features, basic Function features and file behavior features; the dynamic features are obtained by dynamically analyzing the sample file, wherein the dynamic features include file inclusion operation features, sensitive function running features, and sensitive string features.

进一步地，所述依据所述静态特征、所述动态特征和机器学习算法得到分类模型的步骤包括：对所述静态特征和所述动态特征采用所述机器学习算法进行学习，得到所述分类模型。Further, the step of obtaining the classification model according to the static features, the dynamic features and the machine learning algorithm includes: learning the static features and the dynamic features using the machine learning algorithm to obtain the classification model .

进一步地，所述机器学习算法为结合了多种分类算法的集体学习方式。Further, the machine learning algorithm is a collective learning method combining multiple classification algorithms.

进一步地，所述基于机器学习与动静态分析的Webshell检测方法还包括：当所述待检测文件经检测后确认为Webshell时，依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the Webshell detection method based on machine learning and dynamic and static analysis also includes: when the to-be-detected file is confirmed to be a Webshell after being detected, performing machine learning again according to the to-be-detected file and the sample file to update. the classification model.

第二方面，本发明实施例还提出一种基于机器学习与动静态分析的Webshell检测系统，所述基于机器学习与动静态分析的Webshell检测系统包括样本获取模块、特征提取模块及模型建立模块。所述样本获取模块用于获取样本文件；所述特征提取模块用于提取所述样本文件的静态特征和动态特征；所述模型建立模块用于依据所述静态特征、所述动态特征和机器学习算法得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。In a second aspect, an embodiment of the present invention also provides a webshell detection system based on machine learning and dynamic and static analysis, the webshell detection system based on machine learning and dynamic and static analysis includes a sample acquisition module, a feature extraction module, and a model building module. The sample acquisition module is used for acquiring sample files; the feature extraction module is used for extracting static features and dynamic features of the sample files; the model building module is used for The algorithm obtains a classification model, which analyzes the file to be detected and obtains a detection result.

进一步地，所述特征提取模块包括静态分析模块和动态分析模块。所述静态分析模块用于对所述样本文件进行静态分析得到所述静态特征，其中，所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征；所述动态分析模块用于对所述样本文件进行动态分析得到所述动态特征，其中，所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Further, the feature extraction module includes a static analysis module and a dynamic analysis module. The static analysis module is configured to perform static analysis on the sample file to obtain the static features, wherein the static features include document features, basic function features, and file behavior features of the sample file; the dynamic analysis module uses The dynamic features are obtained by dynamically analyzing the sample file, wherein the dynamic features include file inclusion operation features, sensitive function running features, and sensitive character string features.

进一步地，所述模型建立模块用于对所述静态特征和所述动态特征采用所述机器学习算法进行学习，得到所述分类模型。Further, the model building module is configured to use the machine learning algorithm to learn the static features and the dynamic features to obtain the classification model.

进一步地，所述模型建立模块采用的所述机器学习算法为结合了多种分类算法的集体学习方式。Further, the machine learning algorithm adopted by the model building module is a collective learning method combining multiple classification algorithms.

进一步地，所述基于机器学习与动静态分析的Webshell检测系统还包括模型更新模块，所述模型更新模块用于用于当所述待检测文件经检测后确认为Webshell时，依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the Webshell detection system based on machine learning and dynamic and static analysis also includes a model update module, and the model update module is used for when the file to be detected is confirmed to be a Webshell after being detected, according to the to-be-detected file. The file is re-machined with the sample file to update the classification model.

相对现有技术，本发明具有以下有益效果：本发明实施例提供的基于机器学习与动静态分析的Webshell检测方法及系统，通过获取样本文件，提取所述样本文件的静态特征和动态特征，依据所述静态特征、所述动态特征和机器学习算法得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。本发明实施例采用动静态相结合的分析手段，提取特征更全面，采用多种分类算法结合的机器学习算法对大量Webshell样本和正常网页样本进行学习形成分类模型，分类模型稳定性更高，分类更加准确。采用机器学习算法可以应对多特征的复杂分类计算，使检测所涉及的特征不会局限于单一的特征码。用户采用该分类模型可有效检测出Webshell及其变种，预测新型Webshell，能较好地应对文本混淆手段，弥补传统采用特征码匹配检测方式的不足。Relative to the prior art, the present invention has the following beneficial effects: the Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention extract the static features and dynamic features of the sample files by acquiring sample files, according to The static feature, the dynamic feature and the machine learning algorithm obtain a classification model, and the classification model analyzes the file to be detected and obtains a detection result. The embodiment of the present invention adopts a dynamic and static analysis method to extract features more comprehensively, and uses a machine learning algorithm combined with multiple classification algorithms to learn a large number of Webshell samples and normal webpage samples to form a classification model, the classification model is more stable, and the classification more precise. The use of machine learning algorithms can deal with complex classification calculations with multiple features, so that the features involved in detection are not limited to a single feature code. Using this classification model, users can effectively detect Webshells and their variants, predict new Webshells, and can better deal with text obfuscation methods, making up for the shortcomings of traditional signature matching detection methods.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

图1示出了本发明实施例所提供的服务器的方框示意图。FIG. 1 shows a schematic block diagram of a server provided by an embodiment of the present invention.

图2示出了本发明第一实施例所提供的基于机器学习与动静态分析的Webshell检测系统的功能模块图。FIG. 2 shows a functional module diagram of the Webshell detection system based on machine learning and dynamic and static analysis provided by the first embodiment of the present invention.

图3示出了图2中特征提取模块的功能模块图。FIG. 3 shows a functional block diagram of the feature extraction module in FIG. 2 .

图4示出了基于机器学习与动静态分析的Webshell检测系统进行Webshell检测的流程示意图。FIG. 4 shows a schematic flow chart of Webshell detection by a Webshell detection system based on machine learning and dynamic and static analysis.

图5示出了本发明第二实施例所提供的基于机器学习与动静态分析的Webshell检测方法的流程示意图。FIG. 5 shows a schematic flowchart of the Webshell detection method based on machine learning and dynamic and static analysis provided by the second embodiment of the present invention.

图6示出了图5中步骤S202的具体流程示意图。FIG. 6 shows a schematic flowchart of the specific flow of step S202 in FIG. 5 .

图标：100-服务器；400-基于机器学习与动静态分析的Webshell检测系统；110-存储器；120-存储控制器；130-处理器；410-样本获取模块；420-特征提取模块；430-模型建立模块；440-模型更新模块；421-静态分析模块；422-动态分析模块。Icon: 100-server; 400-Webshell detection system based on machine learning and dynamic and static analysis; 110-memory; 120-storage controller; 130-processor; 410-sample acquisition module; 420-feature extraction module; 430-model Building module; 440-model update module; 421-static analysis module; 422-dynamic analysis module.

具体实施方式Detailed ways

下面将结合本发明实施例中附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings are not intended to limit the scope of the invention as claimed, but are merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。同时，在本发明的描述中，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

本发明实施例所提供的基于机器学习与动静态分析的Webshell检测方法及系统可应用于如图1所示的服务器100。在本实施例中，所述服务器100可以是，但不限于，网络服务器、数据库服务器、云端服务器等。如图1所示，服务器100可以包括存储器110、存储控制器120及处理器130。The Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention can be applied to the server 100 shown in FIG. 1 . In this embodiment, the server 100 may be, but not limited to, a network server, a database server, a cloud server, and the like. As shown in FIG. 1 , the server 100 may include a memory 110 , a storage controller 120 and a processor 130 .

所述存储器110、存储控制器120及处理器130，各元件之间直接或间接地电性连接，以实现数据的传输或者交互。例如，这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。基于机器学习与动静态分析的Webshell检测系统400包括至少一个可以软件或固件(firmware)的形式存储于所述存储器110中或固化在所述服务器100的操作系统(operating system，OS)中的软件功能模块。所述处理器130用于执行存储器110中存储的可执行模块，例如，该基于机器学习与动静态分析的Webshell检测系统400所包括的软件功能模块及计算机程序等。The memory 110 , the storage controller 120 and the processor 130 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, these elements may be electrically connected to each other through one or more communication buses or signal lines. The Webshell detection system 400 based on machine learning and dynamic and static analysis includes at least one software that can be stored in the memory 110 in the form of software or firmware or fixed in an operating system (OS) of the server 100 functional module. The processor 130 is configured to execute executable modules stored in the memory 110, for example, software function modules and computer programs included in the Webshell detection system 400 based on machine learning and dynamic and static analysis.

其中，存储器110可以是，但不限于，随机存取存储器(Random Access Memory，RAM)，只读存储器(Read Only Memory，ROM)，可编程只读存储器(Programmable Read-OnlyMemory，PROM)，可擦除只读存储器(Erasable Programmable Read-Only Memory，EPROM)，电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory，EEPROM)等。存储器110可用于存储软件程序以及模块，处理器130用于在接收到执行指令后，执行所述程序。Wherein, the memory 110 may be, but not limited to, a random access memory (Random Access Memory, RAM), a read only memory (Read Only Memory, ROM), a programmable read only memory (Programmable Read-Only Memory, PROM), an erasable memory In addition to read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electrical Erasable Programmable Read-Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM) and the like. The memory 110 can be used to store software programs and modules, and the processor 130 is used to execute the programs after receiving the execution instructions.

处理器130可能是一种集成电路芯片，具有信号的处理能力。上述的处理器130可以是通用处理器，包括中央处理器(Central Processing Unit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器130也可以是任何常规的处理器等。The processor 130 may be an integrated circuit chip with signal processing capability. The above-mentioned processor 130 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. A general purpose processor may be a microprocessor or the processor 130 may be any conventional processor or the like.

可以理解，图1所示的结构仅为示意，所述服务器100还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。图1中所示的各组件可以采用硬件、软件或其组合实现。It can be understood that the structure shown in FIG. 1 is only for illustration, and the server 100 may further include more or less components than those shown in FIG. 1 , or have different configurations from those shown in FIG. 1 . Each component shown in FIG. 1 may be implemented in hardware, software, or a combination thereof.

第一实施例first embodiment

请参照图2，为本发明第一实施例所提供的基于机器学习与动静态分析的Webshell检测系统400的功能模块图。所述基于机器学习与动静态分析的Webshell检测系统400包括样本获取模块410、特征提取模块420及模型建立模块430。Please refer to FIG. 2 , which is a functional block diagram of a Webshell detection system 400 based on machine learning and dynamic and static analysis provided by the first embodiment of the present invention. The webshell detection system 400 based on machine learning and dynamic and static analysis includes a sample acquisition module 410 , a feature extraction module 420 and a model establishment module 430 .

所述样本获取模块410用于获取样本文件。在本实施例中，所述样本文件包括大量的Webshell样本和正常网站样本，其中，所述Webshell样本的类型包括：ASP木马、PHP木马、JSP木马等多种语言编写的木马，从种类上还可分为一句话木马、图片码、功能性上传大马等；正常网站样本为PHP语言的各类CMS，或者为所需检测网站的原始代码等，对此不做限定。优选地，获取的大量样本文件存储在数据库中，用户可自由添加自己收集的Webshell样本和正常网页代码。根据检测环境的不同，提供网站原始文件代码作为正样本通常能够提高模型的准确率，降低误报率。The sample acquisition module 410 is used to acquire sample files. In this embodiment, the sample file includes a large number of Webshell samples and normal website samples, wherein the types of the Webshell samples include: Trojan horses written in various languages such as ASP Trojan horse, PHP Trojan horse, JSP Trojan horse, etc. It can be divided into one-sentence Trojan horse, image code, functional uploader, etc. Normal website samples are all kinds of CMS in PHP language, or the original code of the website to be detected, etc., which is not limited. Preferably, a large number of obtained sample files are stored in a database, and users can freely add Webshell samples and normal webpage codes collected by themselves. Depending on the detection environment, providing the original file code of the website as a positive sample can usually improve the accuracy of the model and reduce the false positive rate.

所述特征提取模块420用于提取所述样本文件的静态特征和动态特征。The feature extraction module 420 is used for extracting static features and dynamic features of the sample file.

在本实施例中，所述特征提取模块420用于对大量的样本文件进行动静态分析。如图3所示，所述特征提取模块420具体包括静态分析模块421和动态分析模块422。In this embodiment, the feature extraction module 420 is used to perform dynamic and static analysis on a large number of sample files. As shown in FIG. 3 , the feature extraction module 420 specifically includes a static analysis module 421 and a dynamic analysis module 422 .

所述静态分析模块421用于对所述样本文件进行静态分析得到所述静态特征，其中，所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征。在本实施例中，所述静态分析模块421主要对样本文件中的字符进行分析，统计出样本文件在多个特征维度上的数值。具体地，所述文档特征可包括但不限于：单词数量、不同单词数量、行数、平均每行单词数、空字符和空格数量、最大单词长度、注释数量等；所述基本函数特征可包括但不限于：字符操作函数、敏感函数调用、系统函数调用数量、脚本区块数、函数参数最大长度、加解密函数调用等；所述文件行为特征可包括但不限于：文件操作、ftp操作、数据库操作等。The static analysis module 421 is configured to perform static analysis on the sample file to obtain the static feature, wherein the static feature includes the document feature, basic function feature, and file behavior feature of the sample file. In this embodiment, the static analysis module 421 mainly analyzes the characters in the sample file, and counts the values of the sample file in multiple feature dimensions. Specifically, the document features may include but are not limited to: the number of words, the number of different words, the number of lines, the average number of words per line, the number of blank characters and spaces, the maximum word length, the number of comments, etc.; the basic function features may include But not limited to: character operation functions, sensitive function calls, number of system function calls, number of script blocks, maximum length of function parameters, encryption and decryption function calls, etc.; the file behavior characteristics may include but are not limited to: file operations, ftp operations, database operations, etc.

所述动态分析模块422用于对所述样本文件进行动态分析得到所述动态特征，其中，所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。在本实施例中，所述动态分析模块422主要是针对不同程序语言，分别建立编译环境或hook扩展，监控并结合外部输入变量的标记追踪、黑白名单机制来进行Webshell的实时动态检测，总结出样本文件的动态特征。在本实施例中，应对Webshell混淆的特征包括：计算文本墒值、文本无效字符数、动态分析运行生成的敏感字符串、敏感函数等。The dynamic analysis module 422 is configured to perform dynamic analysis on the sample file to obtain the dynamic characteristics, wherein the dynamic characteristics include file inclusion operation characteristics, sensitive function running characteristics, and sensitive character string characteristics. In this embodiment, the dynamic analysis module 422 is mainly for different programming languages, establishes a compilation environment or hook extension, monitors and combines the tag tracking of external input variables, and the black and white list mechanism to perform real-time dynamic detection of Webshell. Dynamic characteristics of sample files. In this embodiment, the features for dealing with Webshell confusion include: calculating the text entropy value, the number of invalid characters in the text, a sensitive character string generated by dynamic analysis and running, and a sensitive function.

所述模型建立模块430用于依据所述静态特征、所述动态特征和机器学习算法得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。The model building module 430 is configured to obtain a classification model according to the static feature, the dynamic feature and the machine learning algorithm, and the classification model analyzes the file to be detected and obtains a detection result.

在本实施例中，用户通过上传待检测文件到系统中，该分类模型可完成对待检测文件的Webshell检测，得出分类结果，并生成检测报告供用户查看。In this embodiment, the user uploads the file to be detected into the system, the classification model can complete the Webshell detection of the file to be detected, obtain a classification result, and generate a detection report for the user to view.

在本实施例中，所述模型建立模块430用于对所述静态特征和所述动态特征采用所述机器学习算法进行学习，得到所述分类模型。具体地，所述模型建立模块430首先对所述静态特征和所述动态特征进行归一化操作得到特征向量集，采用机器学习算法对特征向量集进行学习，计算得到分类模型。优选地，在本实施例中，所述机器学习算法为结合了多种分类算法的集体学习方式，具体可包括：随机森林算法、决策树算法、逻辑算法等。结合多种分类算法的集体学习方式能提高模型的稳定性和鲁棒性，从而提高分类模型的检测准确率。In this embodiment, the model building module 430 is configured to use the machine learning algorithm to learn the static features and the dynamic features to obtain the classification model. Specifically, the model building module 430 first performs a normalization operation on the static features and the dynamic features to obtain a feature vector set, uses a machine learning algorithm to learn the feature vector set, and calculates to obtain a classification model. Preferably, in this embodiment, the machine learning algorithm is a collective learning method combining multiple classification algorithms, which may specifically include: random forest algorithm, decision tree algorithm, logic algorithm, and the like. The collective learning method combining multiple classification algorithms can improve the stability and robustness of the model, thereby improving the detection accuracy of the classification model.

需要说明的是，在本实施例中，在分类模型建立完成后，还可使用部分未进行学习的数据对所述分类模型进行测试所述分类模型的检错率、误报率、漏报率等，然后根据测试出的数据调整分类模型，例如调整样本文件中正负样本的比例、数量、类型等，从而提高分类模型的准确度，实现分类模型的优化。It should be noted that, in this embodiment, after the classification model is established, the classification model can also be tested by using some data that has not been learned and so on, and then adjust the classification model according to the tested data, such as adjusting the proportion, number, and type of positive and negative samples in the sample file, so as to improve the accuracy of the classification model and realize the optimization of the classification model.

进一步地，所述基于机器学习与动静态分析的Webshell检测系统400还包括模型更新模块440，所述模型更新模块440用于当所述待检测文件符合Webshell特征时，依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the Webshell detection system 400 based on machine learning and dynamic and static analysis also includes a model update module 440, and the model update module 440 is used for when the to-be-detected file meets the Webshell feature, according to the to-be-detected file and the The sample files are re-machined to update the classification model.

在本实施例中，用户可采用该基于机器学习与动静态分析的Webshell检测系统400对未知文件(也即是待检测文件)进行Webshell检测，当检测到待检测文件为恶意文件Webshell时，则将该待检测文件添加到恶意样本数据库中，和之前的样本文件一起重新进行机器学习实现分类模型的优化和更新。用户使用该系统进行Webshell检测的具体流程可参照图4，具体包括：In this embodiment, the user can use the webshell detection system 400 based on machine learning and dynamic and static analysis to perform webshell detection on unknown files (that is, files to be detected). The to-be-detected file is added to the malicious sample database, and machine learning is performed again together with the previous sample file to optimize and update the classification model. Refer to Figure 4 for the specific process of using the system to perform Webshell detection, including:

步骤S101，获取待检测文件。Step S101, acquiring a file to be detected.

具体地，用户连入到该系统并上传待检测文件，系统通过样本获取模块410获取到该待检测文件。Specifically, the user connects to the system and uploads the file to be detected, and the system obtains the file to be detected through the sample acquisition module 410 .

步骤S102，提取所述待检测文件的静态特征和动态特征。Step S102, extracting static features and dynamic features of the to-be-detected file.

在本实施例中，该系统自动通过特征提取模块420对所述待检测文件进行动静态特征提取。In this embodiment, the system automatically performs dynamic and static feature extraction on the to-be-detected file through the feature extraction module 420 .

步骤S103，采用所述分类模型对所述待检测文件进行分析并得到检测结果。Step S103, using the classification model to analyze the to-be-detected file and obtain a detection result.

具体地，当待检测文件的特征提取完成后，通过建立的分类模型进行检测，以确定该待检测文件是否为恶意文件Webshell，得到检测结果，然后结合特征提取模块420提取出的动静态特征，形成检测报告以便用户查看。例如，检测报告所展示的内容可包括：待检测文件是恶意文件Webshell的可能性百分比、提取的特征(例如恶意的函数、文件操作行为、出现的黑名单字符)等。Specifically, after the feature extraction of the file to be detected is completed, the established classification model is used for detection to determine whether the file to be detected is a malicious file Webshell, the detection result is obtained, and then combined with the dynamic and static features extracted by the feature extraction module 420, Form a test report for the user to view. For example, the content displayed in the detection report may include: probability percentage that the file to be detected is a malicious file Webshell, extracted features (such as malicious functions, file operation behaviors, blacklist characters that appear), and the like.

第二实施例Second Embodiment

请参照图5，为本发明第二实施例所提供的基于机器学习与动静态分析的Webshell检测方法的流程示意图。需要说明的是，本发明实施例所述的基于机器学习与动静态分析的Webshell检测方法并不以图5以及以下所述的具体顺序为限制，其基本原理及产生的技术效果与第一实施例相同，为简要描述，本实施例中未提及部分，可参考第一实施例中的相应内容。应当理解，在其它实施例中，本发明所述的基于机器学习与动静态分析的Webshell检测方法其中部分步骤的顺序可以根据实际需要相互交换，或者其中的部分步骤也可以省略或删除。下面将对图5所示的具体流程进行详细阐述。Please refer to FIG. 5 , which is a schematic flowchart of a Webshell detection method based on machine learning and dynamic and static analysis provided by the second embodiment of the present invention. It should be noted that the Webshell detection method based on machine learning and dynamic and static analysis described in the embodiment of the present invention is not limited to the specific sequence shown in FIG. 5 and the following. The examples are the same. For a brief description, for the parts not mentioned in this embodiment, reference may be made to the corresponding content in the first embodiment. It should be understood that, in other embodiments, the order of some steps in the Webshell detection method based on machine learning and dynamic and static analysis of the present invention may be exchanged according to actual needs, or some steps may be omitted or deleted. The specific flow shown in FIG. 5 will be described in detail below.

步骤S201，获取样本文件。Step S201, obtaining a sample file.

可以理解，该步骤S201可以由上述的样本获取模块410执行。It can be understood that this step S201 may be performed by the above-mentioned sample acquisition module 410 .

步骤S202，提取所述样本文件的静态特征和动态特征。Step S202, extracting static features and dynamic features of the sample file.

可以理解，该步骤S202可以由上述的特征提取模块420执行。It can be understood that this step S202 can be performed by the feature extraction module 420 described above.

如图6所示，在本实施例中，所述步骤S202具体包括如下子步骤：As shown in FIG. 6, in this embodiment, the step S202 specifically includes the following sub-steps:

子步骤S2021，对所述样本文件进行静态分析得到所述静态特征，其中，所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征。Sub-step S2021, performing static analysis on the sample file to obtain the static feature, wherein the static feature includes the document feature, basic function feature, and file behavior feature of the sample file.

可以理解，该步骤S2021可以由上述的静态分析模块421执行。It can be understood that this step S2021 may be performed by the static analysis module 421 described above.

子步骤S2022，对所述样本文件进行动态分析得到所述动态特征，其中，所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Sub-step S2022, dynamically analyze the sample file to obtain the dynamic feature, wherein the dynamic feature includes the file containing operation feature, sensitive function running feature, and sensitive character string feature.

可以理解，该步骤S2022可以由上述的动态分析模块422执行。It can be understood that this step S2022 may be performed by the dynamic analysis module 422 described above.

需要说明的是，在本实施例中，对子步骤S2021、S2022的顺序不做限定，也可以是同时执行。It should be noted that, in this embodiment, the order of the sub-steps S2021 and S2022 is not limited, and they may be performed simultaneously.

步骤S203，依据所述静态特征、所述动态特征和机器学习算法得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。Step S203, a classification model is obtained according to the static feature, the dynamic feature and the machine learning algorithm, and the classification model analyzes the file to be detected and obtains a detection result.

可以理解，该步骤S203可以由上述的模型建立模块430执行。It can be understood that this step S203 may be performed by the above-mentioned model establishment module 430 .

步骤S204，当所述待检测文件经检测后确认为Webshell时，依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Step S204, when the to-be-detected file is confirmed to be a Webshell after being detected, machine learning is performed again according to the to-be-detected file and the sample file to update the classification model.

可以理解，该步骤S204可以由上述的模型更新模块440执行。It can be understood that this step S204 may be performed by the above-mentioned model updating module 440 .

综上所述，本发明实施例所提供的基于机器学习与动静态分析的Webshell检测方法及系统，通过获取样本文件，对所述样本文件进行静态分析和动态分析分别提取所述样本文件的静态特征和动态特征，依据所述静态特征和所述动态特征采用机器学习算法进行学习得到分类模型，所述分类模型对待检测文件进行分析并得到检测结果。进一步地，当所述待检测文件经检测后确认为Webshell时，则将该待检测文件添加到样本数据库中，与之前的样本文件一起重新进行机器学习实现所述分类模型的更新。本发明实施例采用动静态相结合的分析手段，提取特征更全面，采用多种分类算法结合的机器学习算法对大量Webshell样本和正常网页样本进行学习形成分类模型，分类模型稳定性更高，分类更加准确。采用机器学习算法可以应对多特征的复杂分类计算，使检测所涉及的特征不会局限于单一的特征码。用户采用该分类模型可有效检测出Webshell及其变种，预测新型Webshell，能较好地应对文本混淆手段，弥补传统采用特征码匹配检测方式的不足。To sum up, in the Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention, by acquiring sample files, static analysis and dynamic analysis are performed on the sample files to extract the static state of the sample files respectively. Features and dynamic features. According to the static features and the dynamic features, a machine learning algorithm is used to learn a classification model, and the classification model analyzes the file to be detected and obtains a detection result. Further, when the to-be-detected file is confirmed to be a Webshell after being detected, the to-be-detected file is added to the sample database, and machine learning is performed again together with the previous sample file to update the classification model. The embodiment of the present invention adopts a dynamic and static analysis method to extract features more comprehensively, and uses a machine learning algorithm combined with multiple classification algorithms to learn a large number of Webshell samples and normal webpage samples to form a classification model, the classification model is more stable, and the classification more precise. The use of machine learning algorithms can deal with complex classification calculations with multiple features, so that the features involved in detection are not limited to a single feature code. Using this classification model, users can effectively detect Webshells and their variants, predict new Webshells, and can better deal with text obfuscation methods, making up for the shortcomings of traditional signature matching detection methods.

需要说明的是，在本文中，诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

Claims

1. a Webshell detection method based on machine learning and dynamic and static analysis, is characterized in that, the described Webshell detection method based on machine learning and dynamic and static analysis comprises:

get sample files;

Characters in the sample file are analyzed to obtain static features of the sample file, wherein the static features include document features, basic function features, and file behavior features of the sample file, and the document features include the number of words , the number of different words, the number of lines, the average number of words per line, the number of empty characters and spaces, the maximum word length, the number of comments, the basic function features include character operation functions, sensitive function calls, system function calls number, number of script blocks , the maximum length of function parameters, encryption and decryption function calls, and the file behavior characteristics include file operations, ftp operations, and database operations;

For different programming languages, build a compilation environment or hook extension, monitor and combine the tag tracking of external input variables and the black and white list mechanism to perform real-time dynamic detection of Webshell, and summarize the dynamic characteristics of the sample file, wherein the dynamic characteristics Including file contains operation characteristics, sensitive function running characteristics, sensitive string characteristics;

A classification model is obtained according to the static feature, the dynamic feature and the machine learning algorithm, and the classification model analyzes the file to be detected and obtains a detection result.

2. the Webshell detection method based on machine learning and dynamic and static analysis as claimed in claim 1, is characterized in that, the described step that obtains classification model according to described static feature, described dynamic feature and machine learning algorithm comprises:

The machine learning algorithm is used to learn the static features and the dynamic features to obtain the classification model.

3 . The Webshell detection method based on machine learning and dynamic and static analysis according to claim 1 , wherein the machine learning algorithm is a collective learning method combining multiple classification algorithms. 4 .

4. the Webshell detection method based on machine learning and dynamic and static analysis as claimed in claim 1, is characterized in that, the described Webshell detection method based on machine learning and dynamic and static analysis also comprises:

When the to-be-detected file is confirmed to be a Webshell after being detected, machine learning is performed again according to the to-be-detected file and the sample file to update the classification model.

5. a Webshell detection system based on machine learning and dynamic and static analysis, is characterized in that, the described Webshell detection system based on machine learning and dynamic and static analysis comprises:

The sample acquisition module is used to acquire sample files;

A feature extraction module for extracting static features and dynamic features of the sample file; the feature extraction module includes: a static analysis module for analyzing characters in the sample file to obtain the static features of the sample file , wherein the static features include document features, basic function features, and file behavior features of the sample file, and the document features include the number of words, the number of different words, the number of lines, the average number of words per line, the number of empty characters and spaces , the maximum word length, the number of comments, the basic function features include character operation functions, sensitive function calls, the number of system function calls, the number of script blocks, the maximum length of function parameters, encryption and decryption function calls, the file behavior features include file operations , ftp operation, database operation; dynamic analysis module, which is used to establish a compilation environment or hook extension for different programming languages, monitor and combine the tag tracking of external input variables, black and white list mechanism to perform real-time dynamic detection of Webshell, and summarize the results. The dynamic characteristics of the sample file, wherein the dynamic characteristics include file inclusion operation characteristics, sensitive function running characteristics, and sensitive character string characteristics;

The model building module is used to obtain a classification model according to the static features, the dynamic features and the machine learning algorithm, and the classification model analyzes the files to be detected and obtains the detection results.

6. the Webshell detection system based on machine learning and dynamic and static analysis as claimed in claim 5, is characterized in that, described model establishment module is used to adopt described machine learning algorithm to study to described static feature and described dynamic feature , to obtain the classification model.

7 . The Webshell detection system based on machine learning and dynamic and static analysis as claimed in claim 5 , wherein the machine learning algorithm adopted by the model building module is a collective learning method combining multiple classification algorithms. 8 .

8. the Webshell detection system based on machine learning and dynamic and static analysis as claimed in claim 5, is characterized in that, the described Webshell detection system based on machine learning and dynamic and static analysis also comprises:

The model updating module is configured to re-perform machine learning according to the to-be-detected file and the sample file to update the classification model when the to-be-detected file is confirmed to be a Webshell after being detected.