CN108959930A - Malice PDF detection method, system, data storage device and detection program - Google Patents
Malice PDF detection method, system, data storage device and detection program Download PDFInfo
- Publication number
- CN108959930A CN108959930A CN201810832905.3A CN201810832905A CN108959930A CN 108959930 A CN108959930 A CN 108959930A CN 201810832905 A CN201810832905 A CN 201810832905A CN 108959930 A CN108959930 A CN 108959930A
- Authority
- CN
- China
- Prior art keywords
- malice
- files
- malicious
- comentropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 58
- 238000013500 data storage Methods 0.000 title claims abstract description 7
- 238000003066 decision tree Methods 0.000 claims abstract description 15
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 230000003542 behavioural effect Effects 0.000 claims 1
- 229910002056 binary alloy Inorganic materials 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000000034 method Methods 0.000 description 23
- 230000000694 effects Effects 0.000 description 3
- 230000002155 anti-virotic effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/565—Static detection by checking file integrity
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Virology (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Devices For Executing Special Programs (AREA)
- Debugging And Monitoring (AREA)
- Computer And Data Communications (AREA)
Abstract
本发明公开了一种恶意PDF检测方法、系统、数据存储设备和检测程序,属于信息安全技术领域;恶意PDF检测方法为:将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;利用C5.0决策树算法进行分类。本发明能够解决检测范围小,模型检测时间消耗较高等问题。
The invention discloses a malicious PDF detection method, system, data storage device and detection program, which belong to the technical field of information security; the malicious PDF detection method is: convert the PDF file to be checked into a byte sequence, and calculate the information of each PDF file Entropy; set the threshold α according to the maximum, minimum and average values of information entropy values of malicious PDF files and benign PDF files and experience values, compare the information entropy of each PDF file with threshold α, and set the information entropy higher than α PDF files are regarded as normal files, and PDF files with information entropy lower than α are regarded as suspicious files; Origami analysis is used to extract JavaScript and structural features commonly used in malicious attacks in suspicious files; C5.0 decision tree algorithm is used for classification. The invention can solve the problems of small detection range, high time consumption of model detection and the like.
Description
技术领域technical field
本发明应用于信息安全中的恶意PDF文件的检测领域。特别是涉及一种恶意PDF检测方法、系统、数据存储设备和检测程序。The invention is applied to the detection field of malicious PDF files in information security. In particular, it relates to a malicious PDF detection method, system, data storage device and detection program.
背景技术Background technique
便携式文件格式(PDF)是一种电子文档格式,由Adobe系统公司于1993年发布。由于PDF受欢迎程度高、结构灵活、功能多样,越来越多的网络犯罪分子通过PDF文件进行信息窃取、恶意敲诈等网络犯罪行为。并且近年来,对商业组织和政府机构的高级持续性威胁(APT)攻击时有发生,而恶意PDF文件是APT攻击的重要载体,通过执行嵌入在文件内部的恶意代码完成攻击过程。尽管软件供应商努力进行预防、解决,但PDF软件仍经常容易遭受零日攻击,特别是这种攻击利用PDF文件格式与第三方技术(如JavaScript或Flash),从而造成创建临时补丁变得越来越困难。另外,由于PDF文件的体系结构复杂,攻击者使用各种代码混淆技术,使防病毒软件很难提供针对新型恶意PDF文件检测。Portable Document Format (PDF) is an electronic document format released by Adobe Systems Incorporated in 1993. Due to PDF's high popularity, flexible structure, and diverse functions, more and more cybercriminals use PDF files to carry out cybercrimes such as information theft and malicious extortion. And in recent years, advanced persistent threat (APT) attacks on commercial organizations and government agencies have occurred frequently, and malicious PDF files are an important carrier of APT attacks, and the attack process is completed by executing malicious code embedded in the file. Despite software vendors' efforts to prevent and address them, PDF software is often vulnerable to zero-day attacks, especially those that exploit the PDF file format with third-party technologies such as JavaScript or Flash, making it increasingly difficult to create temporary patches. more difficult. In addition, due to the complex architecture of PDF files, attackers use various code obfuscation techniques, making it difficult for antivirus software to provide detection for new malicious PDF files.
通过对恶意PDF文件的分析,针对现有的PDF漏洞,主要攻击方式是基于JavaScript的攻击和基于非JavaScript的攻击。基于JavaScript的攻击方式利用PDF阅读器的漏洞,将执行流程转移到嵌入的恶意JavaScript代码上。基于非JavaScript攻击主要利用许多PDF功能:如“/Launch”、“/Go To”和“/URl”等,自动打开远程资源,增加互联网对客户端的威胁。Through the analysis of malicious PDF files, the main attack methods for existing PDF vulnerabilities are JavaScript-based attacks and non-JavaScript-based attacks. The JavaScript-based attack exploits the vulnerability of the PDF reader to transfer the execution process to the embedded malicious JavaScript code. Non-JavaScript-based attacks mainly use many PDF functions: such as "/Launch", "/Go To" and "/URl", etc., to automatically open remote resources and increase the threat of the Internet to the client.
目前大部分杀毒软件采用基于启发式或字符串匹配的方法进行查杀病毒,但这些方式无法有效地处理多态攻击的问题。为了解决该问题,最近的研究主要集中在两个方面:At present, most antivirus software adopts methods based on heuristics or string matching to detect and kill viruses, but these methods cannot effectively deal with the problem of polymorphic attacks. To address this issue, recent research has focused on two aspects:
(1)利用PDF文件中嵌入的JavaScript,经过静态、动态分析提取其JavaScript特征,再经过机器学习进行分类。这类方法可应对基于恶意JavaScript的攻击,但易受到代码混淆的影响。(1) Use the JavaScript embedded in the PDF file to extract its JavaScript features through static and dynamic analysis, and then classify it through machine learning. Such methods are resilient to malicious JavaScript-based attacks, but are vulnerable to code obfuscation.
(2)利用PDF文件的结构信息来检测恶意PDF文件,其特点是不分析其携带的攻击代码或漏洞,并且这种方法相对于JavaScript分析的优点在于它们能够检测到非JavaScript攻击,并且不会受代码混淆的影响。但是如何增强模型的健壮性是基于结构信息的恶意文件检测方法所面临的大挑战。(2) Use the structural information of PDF files to detect malicious PDF files, which is characterized by not analyzing the attack code or loopholes it carries, and the advantage of this method relative to JavaScript analysis is that they can detect non-JavaScript attacks and will not Suffer from code obfuscation. But how to enhance the robustness of the model is a big challenge for malicious file detection methods based on structural information.
基于以上方法进行恶意PDF文件检测,通常只能检测到基于单一方式的恶意攻击,并且模型时间消耗较高。Malicious PDF file detection based on the above methods usually only detects malicious attacks based on a single method, and the model time consumption is relatively high.
发明内容Contents of the invention
为了解决上述问题,本发明的目的在于提供一种恶意PDF检测方法、系统、数据存储设备和检测程序。In order to solve the above problems, the object of the present invention is to provide a malicious PDF detection method, system, data storage device and detection program.
为了达到上述目的,本发明的技术方案为:In order to achieve the above object, technical scheme of the present invention is:
一种恶意PDF检测方法,至少包括如下步骤:A malicious PDF detection method at least includes the following steps:
步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;
步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;
步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.
进一步:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。Further: the above-mentioned step 1 is specifically: first convert the PDF file to be checked into a binary byte file with PDFParser, and then calculate the information entropy of each PDF file.
进一步:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。Further: the above step three is specifically: first use Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.
进一步:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。Further: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure and JavaScript features; then the vectors and categories are input into the C5.0 decision tree for classification .
本发明的另一目的为:提供一种恶意PDF检测系统,包括:Another object of the present invention is to provide a malicious PDF detection system, comprising:
信息熵计算模块,将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;The information entropy calculation module converts the PDF file to be checked into a byte sequence, and calculates the information entropy of each PDF file;
甄别模块、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;The screening module sets the threshold α according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value, compares the information entropy of each PDF file with the threshold α, and compares the information entropy value with high information entropy The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;
分析模块、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Analysis module, using Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
分类模块、利用C5.0决策树算法进行分类。The classification module uses the C5.0 decision tree algorithm for classification.
本发明的另一目的为:提供一种数据存储设备,包括指令,当其在计算机上运行时,使得计算机执行上述恶意PDF检测方法。Another object of the present invention is to provide a data storage device, including instructions, which, when run on a computer, cause the computer to execute the above malicious PDF detection method.
本发明的另一目的为:提供一种实现上述恶意PDF检测方法的检测程序。Another object of the present invention is to provide a detection program for realizing the above malicious PDF detection method.
本发明具有的优点和积极效果为:The advantages and positive effects that the present invention has are:
本发明将PDF文件的信息熵、javascript特征和结构特征相结合利用C5.0决策树算法进行分类,该方法具有较高的检测精度,并且大大减少了检测时间,增强了实用性。The invention combines information entropy, javascript features and structural features of PDF files to classify using the C5.0 decision tree algorithm. The method has high detection accuracy, greatly reduces detection time, and enhances practicability.
附图说明Description of drawings
图1为本发明优选实施例的流程图;Fig. 1 is the flowchart of preferred embodiment of the present invention;
具体实施方式Detailed ways
为能进一步了解本发明的发明内容、特点及功效,兹例举以下实施例,并配合附图详细说明如下:In order to further understand the invention content, characteristics and effects of the present invention, the following examples are given, and detailed descriptions are as follows in conjunction with the accompanying drawings:
如图1所示,一种恶意PDF检测方法:包括下列步骤:As shown in Figure 1, a malicious PDF detection method: comprises the following steps:
步骤一、将数据集中的PDF文件转换成字节序列,计算每个PDF文件的信息熵;Step 1, convert the PDF files in the data set into byte sequences, and calculate the information entropy of each PDF file;
具体步骤如下:Specific steps are as follows:
(1)首先用PDFParser将数据集中的PDF文件转换成二进制。(1) First use PDFParser to convert the PDF files in the data set into binary.
(2)然后利用公式1计算文件的信息熵。(2) Then use formula 1 to calculate the information entropy of the file.
其中,x代表文件;N代表文件转换成字节序列后不同字节的总数;i代表文件中第i个字节序列中的字节;pi表示字节i出现的概率。Among them, x represents the file; N represents the total number of different bytes after the file is converted into a byte sequence; i represents the byte in the i-th byte sequence in the file; p i represents the probability of byte i appearing.
步骤二、将每个文件的信息熵与阈值α比较,把信息熵高于α的文件作为正常文件,把信息熵低于α的文件作为可疑文件;Step 2. Comparing the information entropy of each file with the threshold α, the files with information entropy higher than α are regarded as normal files, and the files with information entropy lower than α are regarded as suspicious files;
具体步骤如下:Specific steps are as follows:
(1)根据多次试验模拟,设置信息熵阈值α为7.74。(1) According to the simulation of multiple experiments, set the information entropy threshold α to 7.74.
(2)把步骤一得到的信息熵H(x)和阈值α代入公式2,从而得到他们的差值。若差值大于0,则将该PDF文件作为可疑文件进行步骤三,否则作为正常文件输出。(2) Substitute the information entropy H(x) and threshold α obtained in step 1 into formula 2 to obtain their difference. If the difference is greater than 0, then the PDF file is regarded as a suspicious file for step 3, otherwise it is output as a normal file.
ΔH=α-H(x) (2)ΔH=α-H(x) (2)
ΔH:阈值α与待测PDF文件的信息熵H(x)的差值。ΔH: the difference between the threshold α and the information entropy H(x) of the PDF file to be tested.
步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
具体步骤如下:Specific steps are as follows:
(1)首先利用Origami分析PDF文件的结构并搜索恶意特征和结构的一般特征。其中恶意特征包括’/JS','/JavaScript',‘/Go To’,’Go To R’,’Go To E’,’openaction’,'/Submit Form’);结构的一般特征包括文件的大小、间接对象的数量。(1) First use Origami to analyze the structure of the PDF file and search for malicious features and general features of the structure. The malicious features include '/JS', '/JavaScript', '/Go To', 'Go To R', 'Go To E', 'openaction', '/Submit Form'); the general features of the structure include the Size, number of indirect objects.
(2)然后分析PDF文件的JavaScript代码并搜索恶意特征。恶意特征包括substring,fromChar Code,stringcount,document.Write,document.create Element,Eval,setTime Out,eval_length,max_string。(2) Then analyze the JavaScript code of the PDF file and search for malicious features. Malicious features include substring, fromChar Code, stringcount, document.Write, document.create Element, Eval, setTime Out, eval_length, max_string.
步骤四、选取C5.0决策树算法进行分类;Step 4, select the C5.0 decision tree algorithm for classification;
具体步骤如下:Specific steps are as follows:
(1)S是特征样本集合,包括结构特征集合S1和JavaScript特征集合S2。以结构特征为例,元数据类型变量C有K类,属于Ci类的样本数为freq(Ci,S1),利用公式3计算结构特征集合S1的信息熵Info(S1):(1) S is a feature sample set, including a structural feature set S 1 and a JavaScript feature set S 2 . Taking structural features as an example, the metadata type variable C has K classes, and the number of samples belonging to class C i is freq(C i ,S 1 ). Use formula 3 to calculate the information entropy Info(S 1 ) of the structural feature set S 1 :
其中,|S1|是结构特征集合S1中的元素个数。Among them, |S 1 | is the number of elements in the structural feature set S 1 .
(2)特征属性T,有N类,利用公式4计算属性T的条件熵Info(T):(2) The characteristic attribute T has N types, and the conditional entropy Info(T) of the attribute T is calculated by formula 4:
其中,Ti是第i类特征属性。Among them, T i is the feature attribute of the i-th category.
(3)利用公式5计算属性变量T的信息增益Gain(T):(3) Use Formula 5 to calculate the information gain Gain(T) of the attribute variable T:
Gain(T)=Info(S1)-Info(T) (5)Gain(T)=Info(S 1 )-Info(T) (5)
(4)利用信息增益率来生成结点,即公式6:(4) Use the information gain rate to generate nodes, that is, formula 6:
Gainration(A)=Gain(A)/Info(A) (6)Gainration(A)=Gain(A)/Info(A) (6)
其中,Gain(A)表示A情况下时,其产生的子节点信息增益;Info(A)表示情况A下生成的子结点个数指标,分割后的子结点越多,Info(A)越大。Among them, Gain(A) indicates the information gain of child nodes generated in case A; Info(A) indicates the number of child nodes generated in case A, the more child nodes after division, Info(A) bigger.
(5)树生成后,采用基于树规则的方法实现剪枝。(5) After the tree is generated, the method based on tree rules is used to realize pruning.
一种恶意PDF检测系统,包括:A malicious PDF detection system, including:
信息熵计算模块,将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;The information entropy calculation module converts the PDF file to be checked into a byte sequence, and calculates the information entropy of each PDF file;
甄别模块、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;The screening module sets the threshold α according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value, compares the information entropy of each PDF file with the threshold α, and compares the information entropy value with high information entropy The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;
分析模块、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Analysis module, using Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
分类模块、利用C5.0决策树算法进行分类。The classification module uses the C5.0 decision tree algorithm for classification.
一种数据存储设备,包括指令,当其在计算机上运行时,使得计算机执行下面的恶意PDF检测方法;A data storage device, including an instruction, when it is run on a computer, causes the computer to perform the following malicious PDF detection method;
步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;
步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;
步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.
作为优选:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。As a preference: the above-mentioned step one specifically includes: first converting the PDF file to be checked into a binary byte file with PDFParser, and then calculating the information entropy of each PDF file.
作为优选:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。As a preference: the above step three specifically includes: firstly using Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.
作为优选:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。As a preference: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure, and JavaScript features; then the vectors and categories are input to the C5.0 decision tree for Classification.
一种实现下面恶意PDF检测方法的检测程序;A detection program that implements the following malicious PDF detection method;
步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;
步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;
步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;
步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.
作为优选:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。As a preference: the above-mentioned step one specifically includes: first converting the PDF file to be checked into a binary byte file with PDFParser, and then calculating the information entropy of each PDF file.
作为优选:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。As a preference: the above step three specifically includes: firstly using Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.
作为优选:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。As a preference: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure, and JavaScript features; then the vectors and categories are input to the C5.0 decision tree for Classification.
实施例:Example:
为了验证本方法的效果,本发明人设计了相应的实施例,一方面实验设计了不同参数对模型检测效果的影响,另一方面与目前采用较多的恶意PDF文件检测模型:基于JavaScript的检测模型(PJScan)和基于结构特征的检测模型(PDFMS)进行比较。In order to verify the effect of this method, the inventor designed corresponding embodiments. On the one hand, the influence of different parameters on the model detection effect was experimentally designed. model (PJScan) and a structural feature-based detection model (PDFMS) for comparison.
检测数据集采用于Contagiodump,共11207个恶意文件和9745个正常文件,其中有10310个恶意样本中嵌入JavaScript,占恶意样本的92%。正式检测通过10折交叉验证重复10次。The detection data set is used in Contagiodump, a total of 11,207 malicious files and 9,745 normal files, of which 10,310 malicious samples are embedded with JavaScript, accounting for 92% of the malicious samples. Formal detection was repeated 10 times by 10-fold cross-validation.
对比一:为验证本发明在基于不同攻击方式下的检测性能,借此来评价基于信息熵下javascript和结构特征的恶意PDF检测方法是否有利于提高恶意检测的检测精度。实验结果如表1所示。由表1可知,本文方法使检测率达到98.73%,误检率为1.8%。PJScan的检测率(TPR)为71.94%,误检率(FPR)为1.1%。PDFMS的检测率为99.55%,误检率为2.5%。虽然本文提出的方法误检率高于PDFM,但是检测率比PJScan高26.79%。由此可知,本文提出的方法合理有效,并且在有效地检测出基于恶意JavaScropt攻击的恶意PDF文件的同时,又能够有效检测出基于非JavaScropt攻击的恶意PDF文件。Comparison 1: In order to verify the detection performance of the present invention based on different attack methods, it is used to evaluate whether the malicious PDF detection method based on javascript and structural features under information entropy is conducive to improving the detection accuracy of malicious detection. The experimental results are shown in Table 1. It can be seen from Table 1 that the method in this paper makes the detection rate reach 98.73%, and the false detection rate is 1.8%. The detection rate (TPR) of PJScan is 71.94%, and the false detection rate (FPR) is 1.1%. The detection rate of PDFMS is 99.55%, and the false detection rate is 2.5%. Although the false detection rate of the method proposed in this paper is higher than that of PDFM, the detection rate is 26.79% higher than that of PJScan. It can be seen that the method proposed in this paper is reasonable and effective, and while effectively detecting malicious PDF files based on malicious JavaScript attacks, it can also effectively detect malicious PDF files based on non-JavaScript attacks.
表1不同算法检测精度与检测时间比较Table 1 Comparison of detection accuracy and detection time of different algorithms
对比二:为验证本发明在基于不同攻击方式的检测时间,借此来评价利用PDF的JavaScript特征与结构特征进行恶意检测的方法是否有利于减少恶意检测的时间消耗。表1给出了本文方法与PDFMS和PJScan的准确率(TPR)、误检率(FPR)和检测时间(T(s))的比较。由表1可以看出,PDFMS耗费的检测时间是最多的为2330s;PJScan耗费的检测时间居中,为2247s;本发明所提的方法耗费检测时间最少,为1857s,比PDFMS少473s,比PJScan少390s,综上为本发明所提方法在检测时间上均优于PDFMS和PJScan。Comparison 2: In order to verify the detection time of the present invention based on different attack methods, it is used to evaluate whether the method of malicious detection using PDF JavaScript features and structural features is beneficial to reduce the time consumption of malicious detection. Table 1 shows the comparison of accuracy rate (TPR), false detection rate (FPR) and detection time (T(s)) between our method and PDFMS and PJScan. It can be seen from Table 1 that the detection time consumed by PDFMS is the most at 2330s; the detection time consumed by PJScan is in the middle, which is 2247s; the method proposed in the present invention consumes the least detection time, which is 1857s, which is 473s less than PDFMS and less than PJScan 390s. In summary, the method proposed in the present invention is superior to PDFMS and PJScan in terms of detection time.
本发明提供的基于信息熵下javascript和结构特征的恶意PDF检测方法基本原理如下:为了减少时间消耗,首先利用信息熵筛选出可疑文件和正常文件,然后只针对可疑文件进行检测;然后,为了扩大检测范围,在检测时,提取结构特征和JavaScript特征;最后使用C5.0决策树算法进行分类。The basic principle of the malicious PDF detection method based on javascript and structural features under information entropy provided by the present invention is as follows: in order to reduce time consumption, first use information entropy to screen out suspicious files and normal files, and then only detect suspicious files; then, in order to expand Detection range, during detection, extract structural features and JavaScript features; finally use C5.0 decision tree algorithm for classification.
以上对本发明的实施例进行了详细说明,但所述内容仅为本发明的较佳实施例,不能被认为用于限定本发明的实施范围。凡依本发明申请范围所作的均等变化与改进等,均应仍归属于本发明的专利涵盖范围之内。The embodiments of the present invention have been described in detail above, but the content described is only a preferred embodiment of the present invention, and cannot be considered as limiting the implementation scope of the present invention. All equivalent changes and improvements made according to the application scope of the present invention shall still belong to the scope covered by the patent of the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810832905.3A CN108959930A (en) | 2018-07-26 | 2018-07-26 | Malice PDF detection method, system, data storage device and detection program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810832905.3A CN108959930A (en) | 2018-07-26 | 2018-07-26 | Malice PDF detection method, system, data storage device and detection program |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108959930A true CN108959930A (en) | 2018-12-07 |
Family
ID=64464972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810832905.3A Pending CN108959930A (en) | 2018-07-26 | 2018-07-26 | Malice PDF detection method, system, data storage device and detection program |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959930A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069927A (en) * | 2019-04-22 | 2019-07-30 | 中国民航大学 | Malice APK detection method, system, data storage device and detection program |
CN110784561A (en) * | 2019-09-30 | 2020-02-11 | 奇安信科技集团股份有限公司 | IPv6 address segmentation method and similar site or link address set searching method |
CN112231701A (en) * | 2020-09-29 | 2021-01-15 | 广州威尔森信息科技有限公司 | PDF file processing method and device |
CN116578536A (en) * | 2023-07-12 | 2023-08-11 | 北京安天网络安全技术有限公司 | File detection method, storage medium and electronic device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
CN108287992A (en) * | 2017-01-07 | 2018-07-17 | 长沙有干货网络技术有限公司 | A kind of malicious program detection system of the computer learning based on Android |
-
2018
- 2018-07-26 CN CN201810832905.3A patent/CN108959930A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287992A (en) * | 2017-01-07 | 2018-07-17 | 长沙有干货网络技术有限公司 | A kind of malicious program detection system of the computer learning based on Android |
CN107180192A (en) * | 2017-05-09 | 2017-09-19 | 北京理工大学 | Android malicious application detection method and system based on multi-feature fusion |
Non-Patent Citations (3)
Title |
---|
DAVIDE MAIORCA等: "A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files", 《2015 INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY (ICISSP)》 * |
HIMANSHU PAREEK 等: "Entropy and n-gram Analysis of Malicious PDF Documents", 《INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY》 * |
武雪峰: "恶意PDF文档的分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110069927A (en) * | 2019-04-22 | 2019-07-30 | 中国民航大学 | Malice APK detection method, system, data storage device and detection program |
CN110784561A (en) * | 2019-09-30 | 2020-02-11 | 奇安信科技集团股份有限公司 | IPv6 address segmentation method and similar site or link address set searching method |
CN112231701A (en) * | 2020-09-29 | 2021-01-15 | 广州威尔森信息科技有限公司 | PDF file processing method and device |
CN116578536A (en) * | 2023-07-12 | 2023-08-11 | 北京安天网络安全技术有限公司 | File detection method, storage medium and electronic device |
CN116578536B (en) * | 2023-07-12 | 2023-09-22 | 北京安天网络安全技术有限公司 | File detection method, storage medium and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Darem et al. | Visualization and deep-learning-based malware variant detection using OpCode-level features | |
Ni et al. | Malware identification using visualization images and deep learning | |
Fan et al. | Malicious sequential pattern mining for automatic malware detection | |
CN109145600B (en) | System and method for detecting malicious files using static analysis elements | |
Liu et al. | ATMPA: attacking machine learning-based malware visualization detection methods via adversarial examples | |
Gao et al. | Malware classification for the cloud via semi-supervised transfer learning | |
Yan et al. | A survey of adversarial attack and defense methods for malware classification in cyber security | |
CN113935033B (en) | Feature fusion malicious code family classification method, device and storage medium | |
Jung et al. | Android malware detection using convolutional neural networks and data section images | |
Sun et al. | An opcode sequences analysis method for unknown malware detection | |
Saxe et al. | A deep learning approach to fast, format-agnostic detection of malicious web content | |
CN110633570A (en) | A defense method for black-box attack oriented to the detection model of malware assembly format | |
Meng et al. | MCSMGS: malware classification model based on deep learning | |
CN108959930A (en) | Malice PDF detection method, system, data storage device and detection program | |
CN110647745A (en) | Detection method of malware assembly format based on deep learning | |
Kakisim et al. | Sequential opcode embedding-based malware detection method | |
Li et al. | An adversarial machine learning method based on opcode n-grams feature in malware detection | |
CN114003910B (en) | Malicious variety real-time detection method based on dynamic graph comparison learning | |
Manavi et al. | A new method for malware detection using opcode visualization | |
CN112580044B (en) | System and method for detecting malicious files | |
Lu et al. | Malicious word document detection based on multi-view features learning | |
Nahhas et al. | Android Malware Detection Using ResNet-50 Stacking | |
Fu et al. | A hybrid approach for Android malware detection using improved multi-scale convolutional neural networks and residual networks | |
Liu et al. | A2-CLM: Few-Shot Malware Detection Based on Adversarial Heterogeneous Graph Augmentation | |
Hoang et al. | Detecting Malware Based on Statistics and Machine Learning Using Opcode N-Grams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181207 |
|
RJ01 | Rejection of invention patent application after publication |