CN107659570B - Webshell detection method and system based on machine learning and dynamic and static analysis - Google Patents
Webshell detection method and system based on machine learning and dynamic and static analysis Download PDFInfo
- Publication number
- CN107659570B CN107659570B CN201710903110.2A CN201710903110A CN107659570B CN 107659570 B CN107659570 B CN 107659570B CN 201710903110 A CN201710903110 A CN 201710903110A CN 107659570 B CN107659570 B CN 107659570B
- Authority
- CN
- China
- Prior art keywords
- dynamic
- machine learning
- file
- features
- webshell
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003068 static effect Effects 0.000 title claims abstract description 95
- 238000001514 detection method Methods 0.000 title claims abstract description 78
- 238000010801 machine learning Methods 0.000 title claims abstract description 71
- 238000004458 analytical method Methods 0.000 title claims abstract description 67
- 238000013145 classification model Methods 0.000 claims abstract description 51
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000007635 classification algorithm Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 35
- 238000000605 extraction Methods 0.000 claims description 16
- 230000006399 behavior Effects 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 241000283086 Equidae Species 0.000 description 1
- 230000002155 anti-virotic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Hardware Design (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及Webshell检测技术领域,具体而言,涉及一种基于机器学习与动静态分析的Webshell检测方法及系统。The invention relates to the technical field of Webshell detection, in particular to a Webshell detection method and system based on machine learning and dynamic and static analysis.
背景技术Background technique
随着互联网应用的蓬勃发展与互联网数据的极速增长,服务器安全问题日益严峻,而Webshell这类基于Web应用的后门程序对用户信息、甚至整个应用系统的危害极大,因此及时检测发现服务器的漏洞和后门,保证服务器的安全至关重要。With the vigorous development of Internet applications and the rapid growth of Internet data, server security problems are becoming more and more serious, and web application-based backdoor programs such as Webshell are extremely harmful to user information and even the entire application system. Therefore, timely detection and discovery of server vulnerabilities And backdoors, it is crucial to keep the server safe.
由于Webshell大多由脚本语言编写,易修改变形,其特征并非只限于特征码,还包括文件操作函数、恶意执行函数、文件注释大小、单行字符串长度、混淆程度等,当Webshell进行简单变种或将其特征码故意混淆时,传统方法即会漏报此类Webshell,即很容易通过混淆的方式绕过防火墙和杀毒软件的检测,故目前基于特征匹配的Webshell检测方法很难快速检测和识别Webshell的变种。Since most Webshells are written in scripting languages and are easy to be modified and deformed, their features are not limited to signature codes, but also include file operation functions, malicious execution functions, file comment size, single-line string length, and degree of confusion. When the feature code is deliberately obfuscated, the traditional method will miss such webshells, that is, it is easy to bypass the detection of firewalls and antivirus software through obfuscation. Therefore, the current webshell detection methods based on feature matching are difficult to quickly detect and identify the variant.
因此,如何克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性,应对Webshell的文本混淆手段,实现快速检测Webshell及其变种,一直以来都是本领域技术人员关注的重点。Therefore, how to overcome the singleness and hysteresis of the traditional Webshell detection method based on signature matching, deal with the text obfuscation method of Webshell, and realize the rapid detection of Webshell and its variants has always been the focus of those skilled in the art.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种基于机器学习与动静态分析的Webshell检测方法,以克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性,提高Webshell检测的准确性,快速检测Webshell及其变种。The purpose of the present invention is to provide a Webshell detection method based on machine learning and dynamic and static analysis, in order to overcome the singleness and lag of the traditional Webshell detection method based on feature code matching, improve the accuracy of Webshell detection, and quickly detect Webshell and its variant.
本发明的目的还在于提供一种基于机器学习与动静态分析的Webshell检测系统,以克服传统基于特征码匹配的Webshell检测方式的单一性和滞后性,提高Webshell检测的准确性,快速检测Webshell及其变种。The purpose of the present invention is also to provide a Webshell detection system based on machine learning and dynamic and static analysis, so as to overcome the singleness and lag of the traditional Webshell detection method based on feature code matching, improve the accuracy of Webshell detection, and quickly detect Webshell and its variants.
为了实现上述目的,本发明实施例采用的技术方案如下:In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present invention are as follows:
第一方面,本发明实施例提出一种基于机器学习与动静态分析的Webshell检测方法,所述基于机器学习与动静态分析的Webshell检测方法包括:获取样本文件;提取所述样本文件的静态特征和动态特征;依据所述静态特征、所述动态特征和机器学习算法得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。In a first aspect, an embodiment of the present invention proposes a Webshell detection method based on machine learning and dynamic and static analysis. The Webshell detection method based on machine learning and dynamic and static analysis includes: acquiring a sample file; extracting static features of the sample file and dynamic features; a classification model is obtained according to the static features, the dynamic features and the machine learning algorithm, and the classification model analyzes the files to be detected and obtains the detection results.
进一步地,所述提取所述样本文件的静态特征和动态特征的步骤包括:对所述样本文件进行静态分析得到所述静态特征,其中,所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征;对所述样本文件进行动态分析得到所述动态特征,其中,所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Further, the step of extracting the static features and dynamic features of the sample file includes: performing static analysis on the sample file to obtain the static features, wherein the static features include document features, basic Function features and file behavior features; the dynamic features are obtained by dynamically analyzing the sample file, wherein the dynamic features include file inclusion operation features, sensitive function running features, and sensitive string features.
进一步地,所述依据所述静态特征、所述动态特征和机器学习算法得到分类模型的步骤包括:对所述静态特征和所述动态特征采用所述机器学习算法进行学习,得到所述分类模型。Further, the step of obtaining the classification model according to the static features, the dynamic features and the machine learning algorithm includes: learning the static features and the dynamic features using the machine learning algorithm to obtain the classification model .
进一步地,所述机器学习算法为结合了多种分类算法的集体学习方式。Further, the machine learning algorithm is a collective learning method combining multiple classification algorithms.
进一步地,所述基于机器学习与动静态分析的Webshell检测方法还包括:当所述待检测文件经检测后确认为Webshell时,依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the Webshell detection method based on machine learning and dynamic and static analysis also includes: when the to-be-detected file is confirmed to be a Webshell after being detected, performing machine learning again according to the to-be-detected file and the sample file to update. the classification model.
第二方面,本发明实施例还提出一种基于机器学习与动静态分析的Webshell检测系统,所述基于机器学习与动静态分析的Webshell检测系统包括样本获取模块、特征提取模块及模型建立模块。所述样本获取模块用于获取样本文件;所述特征提取模块用于提取所述样本文件的静态特征和动态特征;所述模型建立模块用于依据所述静态特征、所述动态特征和机器学习算法得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。In a second aspect, an embodiment of the present invention also provides a webshell detection system based on machine learning and dynamic and static analysis, the webshell detection system based on machine learning and dynamic and static analysis includes a sample acquisition module, a feature extraction module, and a model building module. The sample acquisition module is used for acquiring sample files; the feature extraction module is used for extracting static features and dynamic features of the sample files; the model building module is used for The algorithm obtains a classification model, which analyzes the file to be detected and obtains a detection result.
进一步地,所述特征提取模块包括静态分析模块和动态分析模块。所述静态分析模块用于对所述样本文件进行静态分析得到所述静态特征,其中,所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征;所述动态分析模块用于对所述样本文件进行动态分析得到所述动态特征,其中,所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Further, the feature extraction module includes a static analysis module and a dynamic analysis module. The static analysis module is configured to perform static analysis on the sample file to obtain the static features, wherein the static features include document features, basic function features, and file behavior features of the sample file; the dynamic analysis module uses The dynamic features are obtained by dynamically analyzing the sample file, wherein the dynamic features include file inclusion operation features, sensitive function running features, and sensitive character string features.
进一步地,所述模型建立模块用于对所述静态特征和所述动态特征采用所述机器学习算法进行学习,得到所述分类模型。Further, the model building module is configured to use the machine learning algorithm to learn the static features and the dynamic features to obtain the classification model.
进一步地,所述模型建立模块采用的所述机器学习算法为结合了多种分类算法的集体学习方式。Further, the machine learning algorithm adopted by the model building module is a collective learning method combining multiple classification algorithms.
进一步地,所述基于机器学习与动静态分析的Webshell检测系统还包括模型更新模块,所述模型更新模块用于用于当所述待检测文件经检测后确认为Webshell时,依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the Webshell detection system based on machine learning and dynamic and static analysis also includes a model update module, and the model update module is used for when the file to be detected is confirmed to be a Webshell after being detected, according to the to-be-detected file. The file is re-machined with the sample file to update the classification model.
相对现有技术,本发明具有以下有益效果:本发明实施例提供的基于机器学习与动静态分析的Webshell检测方法及系统,通过获取样本文件,提取所述样本文件的静态特征和动态特征,依据所述静态特征、所述动态特征和机器学习算法得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。本发明实施例采用动静态相结合的分析手段,提取特征更全面,采用多种分类算法结合的机器学习算法对大量Webshell样本和正常网页样本进行学习形成分类模型,分类模型稳定性更高,分类更加准确。采用机器学习算法可以应对多特征的复杂分类计算,使检测所涉及的特征不会局限于单一的特征码。用户采用该分类模型可有效检测出Webshell及其变种,预测新型Webshell,能较好地应对文本混淆手段,弥补传统采用特征码匹配检测方式的不足。Relative to the prior art, the present invention has the following beneficial effects: the Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention extract the static features and dynamic features of the sample files by acquiring sample files, according to The static feature, the dynamic feature and the machine learning algorithm obtain a classification model, and the classification model analyzes the file to be detected and obtains a detection result. The embodiment of the present invention adopts a dynamic and static analysis method to extract features more comprehensively, and uses a machine learning algorithm combined with multiple classification algorithms to learn a large number of Webshell samples and normal webpage samples to form a classification model, the classification model is more stable, and the classification more precise. The use of machine learning algorithms can deal with complex classification calculations with multiple features, so that the features involved in detection are not limited to a single feature code. Using this classification model, users can effectively detect Webshells and their variants, predict new Webshells, and can better deal with text obfuscation methods, making up for the shortcomings of traditional signature matching detection methods.
为使本发明的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.
图1示出了本发明实施例所提供的服务器的方框示意图。FIG. 1 shows a schematic block diagram of a server provided by an embodiment of the present invention.
图2示出了本发明第一实施例所提供的基于机器学习与动静态分析的Webshell检测系统的功能模块图。FIG. 2 shows a functional module diagram of the Webshell detection system based on machine learning and dynamic and static analysis provided by the first embodiment of the present invention.
图3示出了图2中特征提取模块的功能模块图。FIG. 3 shows a functional block diagram of the feature extraction module in FIG. 2 .
图4示出了基于机器学习与动静态分析的Webshell检测系统进行Webshell检测的流程示意图。FIG. 4 shows a schematic flow chart of Webshell detection by a Webshell detection system based on machine learning and dynamic and static analysis.
图5示出了本发明第二实施例所提供的基于机器学习与动静态分析的Webshell检测方法的流程示意图。FIG. 5 shows a schematic flowchart of the Webshell detection method based on machine learning and dynamic and static analysis provided by the second embodiment of the present invention.
图6示出了图5中步骤S202的具体流程示意图。FIG. 6 shows a schematic flowchart of the specific flow of step S202 in FIG. 5 .
图标:100-服务器;400-基于机器学习与动静态分析的Webshell检测系统;110-存储器;120-存储控制器;130-处理器;410-样本获取模块;420-特征提取模块;430-模型建立模块;440-模型更新模块;421-静态分析模块;422-动态分析模块。Icon: 100-server; 400-Webshell detection system based on machine learning and dynamic and static analysis; 110-memory; 120-storage controller; 130-processor; 410-sample acquisition module; 420-feature extraction module; 430-model Building module; 440-model update module; 421-static analysis module; 422-dynamic analysis module.
具体实施方式Detailed ways
下面将结合本发明实施例中附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings are not intended to limit the scope of the invention as claimed, but are merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。同时,在本发明的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.
本发明实施例所提供的基于机器学习与动静态分析的Webshell检测方法及系统可应用于如图1所示的服务器100。在本实施例中,所述服务器100可以是,但不限于,网络服务器、数据库服务器、云端服务器等。如图1所示,服务器100可以包括存储器110、存储控制器120及处理器130。The Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention can be applied to the
所述存储器110、存储控制器120及处理器130,各元件之间直接或间接地电性连接,以实现数据的传输或者交互。例如,这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。基于机器学习与动静态分析的Webshell检测系统400包括至少一个可以软件或固件(firmware)的形式存储于所述存储器110中或固化在所述服务器100的操作系统(operating system,OS)中的软件功能模块。所述处理器130用于执行存储器110中存储的可执行模块,例如,该基于机器学习与动静态分析的Webshell检测系统400所包括的软件功能模块及计算机程序等。The
其中,存储器110可以是,但不限于,随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-OnlyMemory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。存储器110可用于存储软件程序以及模块,处理器130用于在接收到执行指令后,执行所述程序。Wherein, the
处理器130可能是一种集成电路芯片,具有信号的处理能力。上述的处理器130可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器130也可以是任何常规的处理器等。The
可以理解,图1所示的结构仅为示意,所述服务器100还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。图1中所示的各组件可以采用硬件、软件或其组合实现。It can be understood that the structure shown in FIG. 1 is only for illustration, and the
第一实施例first embodiment
请参照图2,为本发明第一实施例所提供的基于机器学习与动静态分析的Webshell检测系统400的功能模块图。所述基于机器学习与动静态分析的Webshell检测系统400包括样本获取模块410、特征提取模块420及模型建立模块430。Please refer to FIG. 2 , which is a functional block diagram of a
所述样本获取模块410用于获取样本文件。在本实施例中,所述样本文件包括大量的Webshell样本和正常网站样本,其中,所述Webshell样本的类型包括:ASP木马、PHP木马、JSP木马等多种语言编写的木马,从种类上还可分为一句话木马、图片码、功能性上传大马等;正常网站样本为PHP语言的各类CMS,或者为所需检测网站的原始代码等,对此不做限定。优选地,获取的大量样本文件存储在数据库中,用户可自由添加自己收集的Webshell样本和正常网页代码。根据检测环境的不同,提供网站原始文件代码作为正样本通常能够提高模型的准确率,降低误报率。The
所述特征提取模块420用于提取所述样本文件的静态特征和动态特征。The
在本实施例中,所述特征提取模块420用于对大量的样本文件进行动静态分析。如图3所示,所述特征提取模块420具体包括静态分析模块421和动态分析模块422。In this embodiment, the
所述静态分析模块421用于对所述样本文件进行静态分析得到所述静态特征,其中,所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征。在本实施例中,所述静态分析模块421主要对样本文件中的字符进行分析,统计出样本文件在多个特征维度上的数值。具体地,所述文档特征可包括但不限于:单词数量、不同单词数量、行数、平均每行单词数、空字符和空格数量、最大单词长度、注释数量等;所述基本函数特征可包括但不限于:字符操作函数、敏感函数调用、系统函数调用数量、脚本区块数、函数参数最大长度、加解密函数调用等;所述文件行为特征可包括但不限于:文件操作、ftp操作、数据库操作等。The
所述动态分析模块422用于对所述样本文件进行动态分析得到所述动态特征,其中,所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。在本实施例中,所述动态分析模块422主要是针对不同程序语言,分别建立编译环境或hook扩展,监控并结合外部输入变量的标记追踪、黑白名单机制来进行Webshell的实时动态检测,总结出样本文件的动态特征。在本实施例中,应对Webshell混淆的特征包括:计算文本墒值、文本无效字符数、动态分析运行生成的敏感字符串、敏感函数等。The
所述模型建立模块430用于依据所述静态特征、所述动态特征和机器学习算法得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。The
在本实施例中,用户通过上传待检测文件到系统中,该分类模型可完成对待检测文件的Webshell检测,得出分类结果,并生成检测报告供用户查看。In this embodiment, the user uploads the file to be detected into the system, the classification model can complete the Webshell detection of the file to be detected, obtain a classification result, and generate a detection report for the user to view.
在本实施例中,所述模型建立模块430用于对所述静态特征和所述动态特征采用所述机器学习算法进行学习,得到所述分类模型。具体地,所述模型建立模块430首先对所述静态特征和所述动态特征进行归一化操作得到特征向量集,采用机器学习算法对特征向量集进行学习,计算得到分类模型。优选地,在本实施例中,所述机器学习算法为结合了多种分类算法的集体学习方式,具体可包括:随机森林算法、决策树算法、逻辑算法等。结合多种分类算法的集体学习方式能提高模型的稳定性和鲁棒性,从而提高分类模型的检测准确率。In this embodiment, the
需要说明的是,在本实施例中,在分类模型建立完成后,还可使用部分未进行学习的数据对所述分类模型进行测试所述分类模型的检错率、误报率、漏报率等,然后根据测试出的数据调整分类模型,例如调整样本文件中正负样本的比例、数量、类型等,从而提高分类模型的准确度,实现分类模型的优化。It should be noted that, in this embodiment, after the classification model is established, the classification model can also be tested by using some data that has not been learned and so on, and then adjust the classification model according to the tested data, such as adjusting the proportion, number, and type of positive and negative samples in the sample file, so as to improve the accuracy of the classification model and realize the optimization of the classification model.
进一步地,所述基于机器学习与动静态分析的Webshell检测系统400还包括模型更新模块440,所述模型更新模块440用于当所述待检测文件符合Webshell特征时,依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Further, the
在本实施例中,用户可采用该基于机器学习与动静态分析的Webshell检测系统400对未知文件(也即是待检测文件)进行Webshell检测,当检测到待检测文件为恶意文件Webshell时,则将该待检测文件添加到恶意样本数据库中,和之前的样本文件一起重新进行机器学习实现分类模型的优化和更新。用户使用该系统进行Webshell检测的具体流程可参照图4,具体包括:In this embodiment, the user can use the
步骤S101,获取待检测文件。Step S101, acquiring a file to be detected.
具体地,用户连入到该系统并上传待检测文件,系统通过样本获取模块410获取到该待检测文件。Specifically, the user connects to the system and uploads the file to be detected, and the system obtains the file to be detected through the
步骤S102,提取所述待检测文件的静态特征和动态特征。Step S102, extracting static features and dynamic features of the to-be-detected file.
在本实施例中,该系统自动通过特征提取模块420对所述待检测文件进行动静态特征提取。In this embodiment, the system automatically performs dynamic and static feature extraction on the to-be-detected file through the
步骤S103,采用所述分类模型对所述待检测文件进行分析并得到检测结果。Step S103, using the classification model to analyze the to-be-detected file and obtain a detection result.
具体地,当待检测文件的特征提取完成后,通过建立的分类模型进行检测,以确定该待检测文件是否为恶意文件Webshell,得到检测结果,然后结合特征提取模块420提取出的动静态特征,形成检测报告以便用户查看。例如,检测报告所展示的内容可包括:待检测文件是恶意文件Webshell的可能性百分比、提取的特征(例如恶意的函数、文件操作行为、出现的黑名单字符)等。Specifically, after the feature extraction of the file to be detected is completed, the established classification model is used for detection to determine whether the file to be detected is a malicious file Webshell, the detection result is obtained, and then combined with the dynamic and static features extracted by the
第二实施例Second Embodiment
请参照图5,为本发明第二实施例所提供的基于机器学习与动静态分析的Webshell检测方法的流程示意图。需要说明的是,本发明实施例所述的基于机器学习与动静态分析的Webshell检测方法并不以图5以及以下所述的具体顺序为限制,其基本原理及产生的技术效果与第一实施例相同,为简要描述,本实施例中未提及部分,可参考第一实施例中的相应内容。应当理解,在其它实施例中,本发明所述的基于机器学习与动静态分析的Webshell检测方法其中部分步骤的顺序可以根据实际需要相互交换,或者其中的部分步骤也可以省略或删除。下面将对图5所示的具体流程进行详细阐述。Please refer to FIG. 5 , which is a schematic flowchart of a Webshell detection method based on machine learning and dynamic and static analysis provided by the second embodiment of the present invention. It should be noted that the Webshell detection method based on machine learning and dynamic and static analysis described in the embodiment of the present invention is not limited to the specific sequence shown in FIG. 5 and the following. The examples are the same. For a brief description, for the parts not mentioned in this embodiment, reference may be made to the corresponding content in the first embodiment. It should be understood that, in other embodiments, the order of some steps in the Webshell detection method based on machine learning and dynamic and static analysis of the present invention may be exchanged according to actual needs, or some steps may be omitted or deleted. The specific flow shown in FIG. 5 will be described in detail below.
步骤S201,获取样本文件。Step S201, obtaining a sample file.
可以理解,该步骤S201可以由上述的样本获取模块410执行。It can be understood that this step S201 may be performed by the above-mentioned
步骤S202,提取所述样本文件的静态特征和动态特征。Step S202, extracting static features and dynamic features of the sample file.
可以理解,该步骤S202可以由上述的特征提取模块420执行。It can be understood that this step S202 can be performed by the
如图6所示,在本实施例中,所述步骤S202具体包括如下子步骤:As shown in FIG. 6, in this embodiment, the step S202 specifically includes the following sub-steps:
子步骤S2021,对所述样本文件进行静态分析得到所述静态特征,其中,所述静态特征包括所述样本文件的文档特征、基本函数特征、文件行为特征。Sub-step S2021, performing static analysis on the sample file to obtain the static feature, wherein the static feature includes the document feature, basic function feature, and file behavior feature of the sample file.
可以理解,该步骤S2021可以由上述的静态分析模块421执行。It can be understood that this step S2021 may be performed by the
子步骤S2022,对所述样本文件进行动态分析得到所述动态特征,其中,所述动态特征包括文件包含操作特征、敏感函数运行特征、敏感字符串特征。Sub-step S2022, dynamically analyze the sample file to obtain the dynamic feature, wherein the dynamic feature includes the file containing operation feature, sensitive function running feature, and sensitive character string feature.
可以理解,该步骤S2022可以由上述的动态分析模块422执行。It can be understood that this step S2022 may be performed by the
需要说明的是,在本实施例中,对子步骤S2021、S2022的顺序不做限定,也可以是同时执行。It should be noted that, in this embodiment, the order of the sub-steps S2021 and S2022 is not limited, and they may be performed simultaneously.
步骤S203,依据所述静态特征、所述动态特征和机器学习算法得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。Step S203, a classification model is obtained according to the static feature, the dynamic feature and the machine learning algorithm, and the classification model analyzes the file to be detected and obtains a detection result.
可以理解,该步骤S203可以由上述的模型建立模块430执行。It can be understood that this step S203 may be performed by the above-mentioned
步骤S204,当所述待检测文件经检测后确认为Webshell时,依据所述待检测文件与所述样本文件重新进行机器学习以更新所述分类模型。Step S204, when the to-be-detected file is confirmed to be a Webshell after being detected, machine learning is performed again according to the to-be-detected file and the sample file to update the classification model.
可以理解,该步骤S204可以由上述的模型更新模块440执行。It can be understood that this step S204 may be performed by the above-mentioned
综上所述,本发明实施例所提供的基于机器学习与动静态分析的Webshell检测方法及系统,通过获取样本文件,对所述样本文件进行静态分析和动态分析分别提取所述样本文件的静态特征和动态特征,依据所述静态特征和所述动态特征采用机器学习算法进行学习得到分类模型,所述分类模型对待检测文件进行分析并得到检测结果。进一步地,当所述待检测文件经检测后确认为Webshell时,则将该待检测文件添加到样本数据库中,与之前的样本文件一起重新进行机器学习实现所述分类模型的更新。本发明实施例采用动静态相结合的分析手段,提取特征更全面,采用多种分类算法结合的机器学习算法对大量Webshell样本和正常网页样本进行学习形成分类模型,分类模型稳定性更高,分类更加准确。采用机器学习算法可以应对多特征的复杂分类计算,使检测所涉及的特征不会局限于单一的特征码。用户采用该分类模型可有效检测出Webshell及其变种,预测新型Webshell,能较好地应对文本混淆手段,弥补传统采用特征码匹配检测方式的不足。To sum up, in the Webshell detection method and system based on machine learning and dynamic and static analysis provided by the embodiments of the present invention, by acquiring sample files, static analysis and dynamic analysis are performed on the sample files to extract the static state of the sample files respectively. Features and dynamic features. According to the static features and the dynamic features, a machine learning algorithm is used to learn a classification model, and the classification model analyzes the file to be detected and obtains a detection result. Further, when the to-be-detected file is confirmed to be a Webshell after being detected, the to-be-detected file is added to the sample database, and machine learning is performed again together with the previous sample file to update the classification model. The embodiment of the present invention adopts a dynamic and static analysis method to extract features more comprehensively, and uses a machine learning algorithm combined with multiple classification algorithms to learn a large number of Webshell samples and normal webpage samples to form a classification model, the classification model is more stable, and the classification more precise. The use of machine learning algorithms can deal with complex classification calculations with multiple features, so that the features involved in detection are not limited to a single feature code. Using this classification model, users can effectively detect Webshells and their variants, predict new Webshells, and can better deal with text obfuscation methods, making up for the shortcomings of traditional signature matching detection methods.
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710903110.2A CN107659570B (en) | 2017-09-29 | 2017-09-29 | Webshell detection method and system based on machine learning and dynamic and static analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710903110.2A CN107659570B (en) | 2017-09-29 | 2017-09-29 | Webshell detection method and system based on machine learning and dynamic and static analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107659570A CN107659570A (en) | 2018-02-02 |
CN107659570B true CN107659570B (en) | 2020-09-15 |
Family
ID=61116698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710903110.2A Active CN107659570B (en) | 2017-09-29 | 2017-09-29 | Webshell detection method and system based on machine learning and dynamic and static analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107659570B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334781B (en) * | 2018-03-07 | 2020-04-14 | 腾讯科技(深圳)有限公司 | Virus detection method, device, computer readable storage medium and computer equipment |
CN110198291B (en) * | 2018-03-15 | 2022-02-18 | 腾讯科技(深圳)有限公司 | Webpage backdoor detection method, device, terminal and storage medium |
CN108446561A (en) * | 2018-03-21 | 2018-08-24 | 河北师范大学 | A kind of malicious code behavioural characteristic extracting method |
CN108804921A (en) * | 2018-05-29 | 2018-11-13 | 中国科学院信息工程研究所 | The going of a kind of PowerShell codes obscures method and device |
CN110619211A (en) * | 2018-06-20 | 2019-12-27 | 深信服科技股份有限公司 | Malicious software identification method, system and related device based on dynamic characteristics |
CN108985061B (en) * | 2018-07-05 | 2021-10-01 | 北京大学 | A webshell detection method based on model fusion |
CN109598124A (en) * | 2018-12-11 | 2019-04-09 | 厦门服云信息科技有限公司 | A kind of webshell detection method and device |
CN109600382B (en) * | 2018-12-19 | 2021-07-13 | 北京知道创宇信息技术股份有限公司 | Webshell detection method and device and HMM model training method and device |
CN109933977A (en) * | 2019-03-12 | 2019-06-25 | 北京神州绿盟信息安全科技股份有限公司 | A kind of method and device detecting webshell data |
CN110086788A (en) * | 2019-04-17 | 2019-08-02 | 杭州安恒信息技术股份有限公司 | Deep learning WebShell means of defence based on cloud WAF |
CN110210225A (en) * | 2019-05-27 | 2019-09-06 | 四川大学 | A kind of intelligentized Docker container malicious file detection method and device |
CN110750789B (en) * | 2019-10-18 | 2021-07-20 | 杭州奇盾信息技术有限公司 | De-obfuscation method, de-obfuscation device, computer apparatus, and storage medium |
CN111163095B (en) * | 2019-12-31 | 2022-08-30 | 奇安信科技集团股份有限公司 | Network attack analysis method, network attack analysis device, computing device, and medium |
CN113110986A (en) * | 2020-01-13 | 2021-07-13 | 深信服科技股份有限公司 | WebShell script file detection method and system |
CN113111346A (en) * | 2020-01-13 | 2021-07-13 | 深信服科技股份有限公司 | Multi-engine WebShell script file detection method and system |
CN111385295B (en) * | 2020-03-04 | 2022-11-22 | 深信服科技股份有限公司 | WebShell detection method, device, equipment and storage medium |
CN111931187A (en) * | 2020-08-13 | 2020-11-13 | 深信服科技股份有限公司 | Component vulnerability detection method, device, equipment and readable storage medium |
CN112597498A (en) * | 2020-12-29 | 2021-04-02 | 天津睿邦安通技术有限公司 | Webshell detection method, system and device and readable storage medium |
CN112883373A (en) * | 2020-12-30 | 2021-06-01 | 国药集团基因科技有限公司 | PHP type WebShell detection method and detection system thereof |
CN112926054B (en) * | 2021-02-22 | 2023-10-03 | 亚信科技(成都)有限公司 | Malicious file detection method, device, equipment and storage medium |
CN112948834A (en) * | 2021-03-25 | 2021-06-11 | 国药(武汉)医学实验室有限公司 | Deep ensemble learning model construction method for malicious WebShell detection |
CN113239352B (en) * | 2021-04-06 | 2022-05-17 | 中国科学院信息工程研究所 | Webshell detection method and system |
CN116991978B (en) * | 2023-09-26 | 2024-01-02 | 杭州今元标矩科技有限公司 | CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663296B (en) * | 2012-03-31 | 2015-01-07 | 杭州安恒信息技术有限公司 | Intelligent detection method for Java script malicious code facing to the webpage |
CN102779249B (en) * | 2012-06-28 | 2015-07-29 | 北京奇虎科技有限公司 | Malware detection methods and scanning engine |
CN103532949B (en) * | 2013-10-14 | 2017-06-09 | 刘胜利 | Self adaptation wooden horse communication behavior detection method based on dynamical feedback |
CN107169351A (en) * | 2017-05-11 | 2017-09-15 | 北京理工大学 | With reference to the Android unknown malware detection methods of dynamic behaviour feature |
-
2017
- 2017-09-29 CN CN201710903110.2A patent/CN107659570B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107659570A (en) | 2018-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107659570B (en) | Webshell detection method and system based on machine learning and dynamic and static analysis | |
US10972495B2 (en) | Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space | |
CN110826064B (en) | A method, device, electronic device and storage medium for processing malicious files | |
US11381580B2 (en) | Machine learning classification using Markov modeling | |
US9083729B1 (en) | Systems and methods for determining that uniform resource locators are malicious | |
CN110532176B (en) | Formal verification method of intelligent contract, electronic device and storage medium | |
US20160261618A1 (en) | System and method for selectively evolving phishing detection rules | |
US20220156372A1 (en) | Cybersecurity system evaluation and configuration | |
CN108881294A (en) | Attack source IP portrait generation method and device based on attack | |
CN109862003B (en) | Method, device, system and storage medium for generating local threat intelligence library | |
US20190297092A1 (en) | Access classification device, access classification method, and recording medium | |
CN107948168A (en) | Page detection method and device | |
CN107395650B (en) | Method and device for identifying Trojan back connection based on sandbox detection file | |
CN107612908B (en) | Web page tampering monitoring method and device | |
RU2722692C1 (en) | Method and system for detecting malicious files in a non-isolated medium | |
EP3799367B1 (en) | Generation device, generation method, and generation program | |
CN112148305A (en) | Application detection method and device, computer equipment and readable storage medium | |
CN107766467B (en) | Information detection method and device, electronic equipment and storage medium | |
Deore et al. | MDFRCNN: Malware detection using faster region proposals convolution neural network | |
US20180341770A1 (en) | Anomaly detection method and anomaly detection apparatus | |
CN116015842A (en) | A network attack detection method based on user access behavior | |
US20220237289A1 (en) | Automated malware classification with human-readable explanations | |
US10754950B2 (en) | Entity resolution-based malicious file detection | |
CN114462040A (en) | Malicious software detection model training method, malicious software detection method and malicious software detection device | |
CN111382432A (en) | Malware detection and classification model generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 310000 No. 188 Lianhui Street, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Dbappsecurity Co.,Ltd. Address before: Zhejiang Zhongcai Building No. 68 Binjiang District road Hangzhou City, Zhejiang Province, the 310051 and 15 layer Applicant before: DBAPPSECURITY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20180202 Assignee: Hangzhou Anheng Information Security Technology Co.,Ltd. Assignor: Dbappsecurity Co.,Ltd. Contract record no.: X2024980043369 Denomination of invention: Webshell detection method and system based on machine learning and dynamic and static analysis Granted publication date: 20200915 License type: Common License Record date: 20241231 |
|
EE01 | Entry into force of recordation of patent licensing contract |