CN108959930A - Malice PDF detection method, system, data storage device and detection program - Google Patents

Malice PDF detection method, system, data storage device and detection program Download PDF

Info

Publication number
CN108959930A
CN108959930A CN201810832905.3A CN201810832905A CN108959930A CN 108959930 A CN108959930 A CN 108959930A CN 201810832905 A CN201810832905 A CN 201810832905A CN 108959930 A CN108959930 A CN 108959930A
Authority
CN
China
Prior art keywords
pdf
malice
files
malicious
comentropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810832905.3A
Other languages
Chinese (zh)
Inventor
李国�
黄永健
王静
徐俊洁
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN201810832905.3A priority Critical patent/CN108959930A/en
Publication of CN108959930A publication Critical patent/CN108959930A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/565Static detection by checking file integrity

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明公开了一种恶意PDF检测方法、系统、数据存储设备和检测程序,属于信息安全技术领域;恶意PDF检测方法为:将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;利用C5.0决策树算法进行分类。本发明能够解决检测范围小,模型检测时间消耗较高等问题。

The invention discloses a malicious PDF detection method, system, data storage device and detection program, which belong to the technical field of information security; the malicious PDF detection method is: convert the PDF file to be checked into a byte sequence, and calculate the information of each PDF file Entropy; set the threshold α according to the maximum, minimum and average values of information entropy values of malicious PDF files and benign PDF files and experience values, compare the information entropy of each PDF file with threshold α, and set the information entropy higher than α PDF files are regarded as normal files, and PDF files with information entropy lower than α are regarded as suspicious files; Origami analysis is used to extract JavaScript and structural features commonly used in malicious attacks in suspicious files; C5.0 decision tree algorithm is used for classification. The invention can solve the problems of small detection range, high time consumption of model detection and the like.

Description

恶意PDF检测方法、系统、数据存储设备和检测程序Malicious PDF detection method, system, data storage device and detection program

技术领域technical field

本发明应用于信息安全中的恶意PDF文件的检测领域。特别是涉及一种恶意PDF检测方法、系统、数据存储设备和检测程序。The invention is applied to the detection field of malicious PDF files in information security. In particular, it relates to a malicious PDF detection method, system, data storage device and detection program.

背景技术Background technique

便携式文件格式(PDF)是一种电子文档格式,由Adobe系统公司于1993年发布。由于PDF受欢迎程度高、结构灵活、功能多样,越来越多的网络犯罪分子通过PDF文件进行信息窃取、恶意敲诈等网络犯罪行为。并且近年来,对商业组织和政府机构的高级持续性威胁(APT)攻击时有发生,而恶意PDF文件是APT攻击的重要载体,通过执行嵌入在文件内部的恶意代码完成攻击过程。尽管软件供应商努力进行预防、解决,但PDF软件仍经常容易遭受零日攻击,特别是这种攻击利用PDF文件格式与第三方技术(如JavaScript或Flash),从而造成创建临时补丁变得越来越困难。另外,由于PDF文件的体系结构复杂,攻击者使用各种代码混淆技术,使防病毒软件很难提供针对新型恶意PDF文件检测。Portable Document Format (PDF) is an electronic document format released by Adobe Systems Incorporated in 1993. Due to PDF's high popularity, flexible structure, and diverse functions, more and more cybercriminals use PDF files to carry out cybercrimes such as information theft and malicious extortion. And in recent years, advanced persistent threat (APT) attacks on commercial organizations and government agencies have occurred frequently, and malicious PDF files are an important carrier of APT attacks, and the attack process is completed by executing malicious code embedded in the file. Despite software vendors' efforts to prevent and address them, PDF software is often vulnerable to zero-day attacks, especially those that exploit the PDF file format with third-party technologies such as JavaScript or Flash, making it increasingly difficult to create temporary patches. more difficult. In addition, due to the complex architecture of PDF files, attackers use various code obfuscation techniques, making it difficult for antivirus software to provide detection for new malicious PDF files.

通过对恶意PDF文件的分析,针对现有的PDF漏洞,主要攻击方式是基于JavaScript的攻击和基于非JavaScript的攻击。基于JavaScript的攻击方式利用PDF阅读器的漏洞,将执行流程转移到嵌入的恶意JavaScript代码上。基于非JavaScript攻击主要利用许多PDF功能:如“/Launch”、“/Go To”和“/URl”等,自动打开远程资源,增加互联网对客户端的威胁。Through the analysis of malicious PDF files, the main attack methods for existing PDF vulnerabilities are JavaScript-based attacks and non-JavaScript-based attacks. The JavaScript-based attack exploits the vulnerability of the PDF reader to transfer the execution process to the embedded malicious JavaScript code. Non-JavaScript-based attacks mainly use many PDF functions: such as "/Launch", "/Go To" and "/URl", etc., to automatically open remote resources and increase the threat of the Internet to the client.

目前大部分杀毒软件采用基于启发式或字符串匹配的方法进行查杀病毒,但这些方式无法有效地处理多态攻击的问题。为了解决该问题,最近的研究主要集中在两个方面:At present, most antivirus software adopts methods based on heuristics or string matching to detect and kill viruses, but these methods cannot effectively deal with the problem of polymorphic attacks. To address this issue, recent research has focused on two aspects:

(1)利用PDF文件中嵌入的JavaScript,经过静态、动态分析提取其JavaScript特征,再经过机器学习进行分类。这类方法可应对基于恶意JavaScript的攻击,但易受到代码混淆的影响。(1) Use the JavaScript embedded in the PDF file to extract its JavaScript features through static and dynamic analysis, and then classify it through machine learning. Such methods are resilient to malicious JavaScript-based attacks, but are vulnerable to code obfuscation.

(2)利用PDF文件的结构信息来检测恶意PDF文件,其特点是不分析其携带的攻击代码或漏洞,并且这种方法相对于JavaScript分析的优点在于它们能够检测到非JavaScript攻击,并且不会受代码混淆的影响。但是如何增强模型的健壮性是基于结构信息的恶意文件检测方法所面临的大挑战。(2) Use the structural information of PDF files to detect malicious PDF files, which is characterized by not analyzing the attack code or loopholes it carries, and the advantage of this method relative to JavaScript analysis is that they can detect non-JavaScript attacks and will not Suffer from code obfuscation. But how to enhance the robustness of the model is a big challenge for malicious file detection methods based on structural information.

基于以上方法进行恶意PDF文件检测,通常只能检测到基于单一方式的恶意攻击,并且模型时间消耗较高。Malicious PDF file detection based on the above methods usually only detects malicious attacks based on a single method, and the model time consumption is relatively high.

发明内容Contents of the invention

为了解决上述问题,本发明的目的在于提供一种恶意PDF检测方法、系统、数据存储设备和检测程序。In order to solve the above problems, the object of the present invention is to provide a malicious PDF detection method, system, data storage device and detection program.

为了达到上述目的,本发明的技术方案为:In order to achieve the above object, technical scheme of the present invention is:

一种恶意PDF检测方法,至少包括如下步骤:A malicious PDF detection method at least includes the following steps:

步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;

步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;

步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.

进一步:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。Further: the above-mentioned step 1 is specifically: first convert the PDF file to be checked into a binary byte file with PDFParser, and then calculate the information entropy of each PDF file.

进一步:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。Further: the above step three is specifically: first use Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.

进一步:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。Further: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure and JavaScript features; then the vectors and categories are input into the C5.0 decision tree for classification .

本发明的另一目的为:提供一种恶意PDF检测系统,包括:Another object of the present invention is to provide a malicious PDF detection system, comprising:

信息熵计算模块,将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;The information entropy calculation module converts the PDF file to be checked into a byte sequence, and calculates the information entropy of each PDF file;

甄别模块、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;The screening module sets the threshold α according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value, compares the information entropy of each PDF file with the threshold α, and compares the information entropy value with high information entropy The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;

分析模块、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Analysis module, using Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

分类模块、利用C5.0决策树算法进行分类。The classification module uses the C5.0 decision tree algorithm for classification.

本发明的另一目的为:提供一种数据存储设备,包括指令,当其在计算机上运行时,使得计算机执行上述恶意PDF检测方法。Another object of the present invention is to provide a data storage device, including instructions, which, when run on a computer, cause the computer to execute the above malicious PDF detection method.

本发明的另一目的为:提供一种实现上述恶意PDF检测方法的检测程序。Another object of the present invention is to provide a detection program for realizing the above malicious PDF detection method.

本发明具有的优点和积极效果为:The advantages and positive effects that the present invention has are:

本发明将PDF文件的信息熵、javascript特征和结构特征相结合利用C5.0决策树算法进行分类,该方法具有较高的检测精度,并且大大减少了检测时间,增强了实用性。The invention combines information entropy, javascript features and structural features of PDF files to classify using the C5.0 decision tree algorithm. The method has high detection accuracy, greatly reduces detection time, and enhances practicability.

附图说明Description of drawings

图1为本发明优选实施例的流程图;Fig. 1 is the flowchart of preferred embodiment of the present invention;

具体实施方式Detailed ways

为能进一步了解本发明的发明内容、特点及功效,兹例举以下实施例,并配合附图详细说明如下:In order to further understand the invention content, characteristics and effects of the present invention, the following examples are given, and detailed descriptions are as follows in conjunction with the accompanying drawings:

如图1所示,一种恶意PDF检测方法:包括下列步骤:As shown in Figure 1, a malicious PDF detection method: comprises the following steps:

步骤一、将数据集中的PDF文件转换成字节序列,计算每个PDF文件的信息熵;Step 1, convert the PDF files in the data set into byte sequences, and calculate the information entropy of each PDF file;

具体步骤如下:Specific steps are as follows:

(1)首先用PDFParser将数据集中的PDF文件转换成二进制。(1) First use PDFParser to convert the PDF files in the data set into binary.

(2)然后利用公式1计算文件的信息熵。(2) Then use formula 1 to calculate the information entropy of the file.

其中,x代表文件;N代表文件转换成字节序列后不同字节的总数;i代表文件中第i个字节序列中的字节;pi表示字节i出现的概率。Among them, x represents the file; N represents the total number of different bytes after the file is converted into a byte sequence; i represents the byte in the i-th byte sequence in the file; p i represents the probability of byte i appearing.

步骤二、将每个文件的信息熵与阈值α比较,把信息熵高于α的文件作为正常文件,把信息熵低于α的文件作为可疑文件;Step 2. Comparing the information entropy of each file with the threshold α, the files with information entropy higher than α are regarded as normal files, and the files with information entropy lower than α are regarded as suspicious files;

具体步骤如下:Specific steps are as follows:

(1)根据多次试验模拟,设置信息熵阈值α为7.74。(1) According to the simulation of multiple experiments, set the information entropy threshold α to 7.74.

(2)把步骤一得到的信息熵H(x)和阈值α代入公式2,从而得到他们的差值。若差值大于0,则将该PDF文件作为可疑文件进行步骤三,否则作为正常文件输出。(2) Substitute the information entropy H(x) and threshold α obtained in step 1 into formula 2 to obtain their difference. If the difference is greater than 0, then the PDF file is regarded as a suspicious file for step 3, otherwise it is output as a normal file.

ΔH=α-H(x) (2)ΔH=α-H(x) (2)

ΔH:阈值α与待测PDF文件的信息熵H(x)的差值。ΔH: the difference between the threshold α and the information entropy H(x) of the PDF file to be tested.

步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

具体步骤如下:Specific steps are as follows:

(1)首先利用Origami分析PDF文件的结构并搜索恶意特征和结构的一般特征。其中恶意特征包括’/JS','/JavaScript',‘/Go To’,’Go To R’,’Go To E’,’openaction’,'/Submit Form’);结构的一般特征包括文件的大小、间接对象的数量。(1) First use Origami to analyze the structure of the PDF file and search for malicious features and general features of the structure. The malicious features include '/JS', '/JavaScript', '/Go To', 'Go To R', 'Go To E', 'openaction', '/Submit Form'); the general features of the structure include the Size, number of indirect objects.

(2)然后分析PDF文件的JavaScript代码并搜索恶意特征。恶意特征包括substring,fromChar Code,stringcount,document.Write,document.create Element,Eval,setTime Out,eval_length,max_string。(2) Then analyze the JavaScript code of the PDF file and search for malicious features. Malicious features include substring, fromChar Code, stringcount, document.Write, document.create Element, Eval, setTime Out, eval_length, max_string.

步骤四、选取C5.0决策树算法进行分类;Step 4, select the C5.0 decision tree algorithm for classification;

具体步骤如下:Specific steps are as follows:

(1)S是特征样本集合,包括结构特征集合S1和JavaScript特征集合S2。以结构特征为例,元数据类型变量C有K类,属于Ci类的样本数为freq(Ci,S1),利用公式3计算结构特征集合S1的信息熵Info(S1):(1) S is a feature sample set, including a structural feature set S 1 and a JavaScript feature set S 2 . Taking structural features as an example, the metadata type variable C has K classes, and the number of samples belonging to class C i is freq(C i ,S 1 ). Use formula 3 to calculate the information entropy Info(S 1 ) of the structural feature set S 1 :

其中,|S1|是结构特征集合S1中的元素个数。Among them, |S 1 | is the number of elements in the structural feature set S 1 .

(2)特征属性T,有N类,利用公式4计算属性T的条件熵Info(T):(2) The characteristic attribute T has N types, and the conditional entropy Info(T) of the attribute T is calculated by formula 4:

其中,Ti是第i类特征属性。Among them, T i is the feature attribute of the i-th category.

(3)利用公式5计算属性变量T的信息增益Gain(T):(3) Use Formula 5 to calculate the information gain Gain(T) of the attribute variable T:

Gain(T)=Info(S1)-Info(T) (5)Gain(T)=Info(S 1 )-Info(T) (5)

(4)利用信息增益率来生成结点,即公式6:(4) Use the information gain rate to generate nodes, that is, formula 6:

Gainration(A)=Gain(A)/Info(A) (6)Gainration(A)=Gain(A)/Info(A) (6)

其中,Gain(A)表示A情况下时,其产生的子节点信息增益;Info(A)表示情况A下生成的子结点个数指标,分割后的子结点越多,Info(A)越大。Among them, Gain(A) indicates the information gain of child nodes generated in case A; Info(A) indicates the number of child nodes generated in case A, the more child nodes after division, Info(A) bigger.

(5)树生成后,采用基于树规则的方法实现剪枝。(5) After the tree is generated, the method based on tree rules is used to realize pruning.

一种恶意PDF检测系统,包括:A malicious PDF detection system, including:

信息熵计算模块,将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;The information entropy calculation module converts the PDF file to be checked into a byte sequence, and calculates the information entropy of each PDF file;

甄别模块、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;The screening module sets the threshold α according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value, compares the information entropy of each PDF file with the threshold α, and compares the information entropy value with high information entropy The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;

分析模块、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Analysis module, using Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

分类模块、利用C5.0决策树算法进行分类。The classification module uses the C5.0 decision tree algorithm for classification.

一种数据存储设备,包括指令,当其在计算机上运行时,使得计算机执行下面的恶意PDF检测方法;A data storage device, including an instruction, when it is run on a computer, causes the computer to perform the following malicious PDF detection method;

步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;

步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;

步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.

作为优选:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。As a preference: the above-mentioned step one specifically includes: first converting the PDF file to be checked into a binary byte file with PDFParser, and then calculating the information entropy of each PDF file.

作为优选:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。As a preference: the above step three specifically includes: firstly using Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.

作为优选:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。As a preference: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure, and JavaScript features; then the vectors and categories are input to the C5.0 decision tree for Classification.

一种实现下面恶意PDF检测方法的检测程序;A detection program that implements the following malicious PDF detection method;

步骤一、将待检PDF文件转换成字节序列,计算每个PDF件的信息熵;Step 1, convert the PDF file to be checked into a byte sequence, and calculate the information entropy of each PDF file;

步骤二、根据统计的恶意PDF文件和良性PDF文件的信息熵值的最大值、最小值和平均值以及经验值设置阈值α,将每个PDF文件的信息熵与阈值α比较,把信息熵高于α的PDF文件作为正常文件,把信息熵低于α的PDF文件作为可疑文件;Step 2, according to the maximum value, minimum value and average value of the information entropy value of the malicious PDF file and the benign PDF file and the empirical value setting threshold α, the information entropy of each PDF file is compared with the threshold α, and the information entropy is high The PDF files whose information entropy is lower than α are regarded as normal files, and the PDF files whose information entropy is lower than α are regarded as suspicious files;

步骤三、利用Origami分析提取可疑文件中常用于恶意攻击的JavaScript和结构特征;Step 3: Use Origami to analyze and extract JavaScript and structural features commonly used in malicious attacks in suspicious files;

步骤四、利用C5.0决策树算法进行分类。Step 4: Use the C5.0 decision tree algorithm to classify.

作为优选:上述步骤一具体为:首先用PDFParser将待检PDF文件转换成二进制字节文件,然后计算每个PDF文件的信息熵。As a preference: the above-mentioned step one specifically includes: first converting the PDF file to be checked into a binary byte file with PDFParser, and then calculating the information entropy of each PDF file.

作为优选:上述步骤三具体为:首先利用Origami分析可疑文件的结构并搜索恶意特征和结构的一般特征,然后再分析可疑文件的JavaScript代码并搜索恶意特征。As a preference: the above step three specifically includes: firstly using Origami to analyze the structure of the suspicious file and search for malicious features and general features of the structure, and then analyze the JavaScript code of the suspicious file and search for malicious features.

作为优选:上述步骤四具体为:首先把每个PDF文件用一个向量表示,该向量由结构的一般特征、结构的动态特征和JavaScript特征组成;然后将向量、类别输入到C5.0决策树进行分类。As a preference: the above step 4 is specifically as follows: first, each PDF file is represented by a vector, which is composed of general features of the structure, dynamic features of the structure, and JavaScript features; then the vectors and categories are input to the C5.0 decision tree for Classification.

实施例:Example:

为了验证本方法的效果,本发明人设计了相应的实施例,一方面实验设计了不同参数对模型检测效果的影响,另一方面与目前采用较多的恶意PDF文件检测模型:基于JavaScript的检测模型(PJScan)和基于结构特征的检测模型(PDFMS)进行比较。In order to verify the effect of this method, the inventor designed corresponding embodiments. On the one hand, the influence of different parameters on the model detection effect was experimentally designed. model (PJScan) and a structural feature-based detection model (PDFMS) for comparison.

检测数据集采用于Contagiodump,共11207个恶意文件和9745个正常文件,其中有10310个恶意样本中嵌入JavaScript,占恶意样本的92%。正式检测通过10折交叉验证重复10次。The detection data set is used in Contagiodump, a total of 11,207 malicious files and 9,745 normal files, of which 10,310 malicious samples are embedded with JavaScript, accounting for 92% of the malicious samples. Formal detection was repeated 10 times by 10-fold cross-validation.

对比一:为验证本发明在基于不同攻击方式下的检测性能,借此来评价基于信息熵下javascript和结构特征的恶意PDF检测方法是否有利于提高恶意检测的检测精度。实验结果如表1所示。由表1可知,本文方法使检测率达到98.73%,误检率为1.8%。PJScan的检测率(TPR)为71.94%,误检率(FPR)为1.1%。PDFMS的检测率为99.55%,误检率为2.5%。虽然本文提出的方法误检率高于PDFM,但是检测率比PJScan高26.79%。由此可知,本文提出的方法合理有效,并且在有效地检测出基于恶意JavaScropt攻击的恶意PDF文件的同时,又能够有效检测出基于非JavaScropt攻击的恶意PDF文件。Comparison 1: In order to verify the detection performance of the present invention based on different attack methods, it is used to evaluate whether the malicious PDF detection method based on javascript and structural features under information entropy is conducive to improving the detection accuracy of malicious detection. The experimental results are shown in Table 1. It can be seen from Table 1 that the method in this paper makes the detection rate reach 98.73%, and the false detection rate is 1.8%. The detection rate (TPR) of PJScan is 71.94%, and the false detection rate (FPR) is 1.1%. The detection rate of PDFMS is 99.55%, and the false detection rate is 2.5%. Although the false detection rate of the method proposed in this paper is higher than that of PDFM, the detection rate is 26.79% higher than that of PJScan. It can be seen that the method proposed in this paper is reasonable and effective, and while effectively detecting malicious PDF files based on malicious JavaScript attacks, it can also effectively detect malicious PDF files based on non-JavaScript attacks.

表1不同算法检测精度与检测时间比较Table 1 Comparison of detection accuracy and detection time of different algorithms

对比二:为验证本发明在基于不同攻击方式的检测时间,借此来评价利用PDF的JavaScript特征与结构特征进行恶意检测的方法是否有利于减少恶意检测的时间消耗。表1给出了本文方法与PDFMS和PJScan的准确率(TPR)、误检率(FPR)和检测时间(T(s))的比较。由表1可以看出,PDFMS耗费的检测时间是最多的为2330s;PJScan耗费的检测时间居中,为2247s;本发明所提的方法耗费检测时间最少,为1857s,比PDFMS少473s,比PJScan少390s,综上为本发明所提方法在检测时间上均优于PDFMS和PJScan。Comparison 2: In order to verify the detection time of the present invention based on different attack methods, it is used to evaluate whether the method of malicious detection using PDF JavaScript features and structural features is beneficial to reduce the time consumption of malicious detection. Table 1 shows the comparison of accuracy rate (TPR), false detection rate (FPR) and detection time (T(s)) between our method and PDFMS and PJScan. It can be seen from Table 1 that the detection time consumed by PDFMS is the most at 2330s; the detection time consumed by PJScan is in the middle, which is 2247s; the method proposed in the present invention consumes the least detection time, which is 1857s, which is 473s less than PDFMS and less than PJScan 390s. In summary, the method proposed in the present invention is superior to PDFMS and PJScan in terms of detection time.

本发明提供的基于信息熵下javascript和结构特征的恶意PDF检测方法基本原理如下:为了减少时间消耗,首先利用信息熵筛选出可疑文件和正常文件,然后只针对可疑文件进行检测;然后,为了扩大检测范围,在检测时,提取结构特征和JavaScript特征;最后使用C5.0决策树算法进行分类。The basic principle of the malicious PDF detection method based on javascript and structural features under information entropy provided by the present invention is as follows: in order to reduce time consumption, first use information entropy to screen out suspicious files and normal files, and then only detect suspicious files; then, in order to expand Detection range, during detection, extract structural features and JavaScript features; finally use C5.0 decision tree algorithm for classification.

以上对本发明的实施例进行了详细说明,但所述内容仅为本发明的较佳实施例,不能被认为用于限定本发明的实施范围。凡依本发明申请范围所作的均等变化与改进等,均应仍归属于本发明的专利涵盖范围之内。The embodiments of the present invention have been described in detail above, but the content described is only a preferred embodiment of the present invention, and cannot be considered as limiting the implementation scope of the present invention. All equivalent changes and improvements made according to the application scope of the present invention shall still belong to the scope covered by the patent of the present invention.

Claims (7)

1. a kind of malice PDF detection method, it is characterised in that: include at least following steps:
Step 1: pdf document to be checked is converted into byte sequence, the comentropy of each PDF part is calculated;
Step 2: according to the maximum value of the information entropy of the malice pdf document of statistics and benign pdf document, minimum value and being averaged Threshold alpha is arranged in value and empirical value, by the comentropy of each pdf document compared with threshold alpha, comentropy is higher than the pdf document of α As normal file, the pdf document using comentropy lower than α is as apocrypha;
Step 3: analyzing the JavaScript and structure feature for extracting and being usually used in malicious attack in apocrypha using Origami;
Step 4: being classified using C5.0 decision Tree algorithms.
2. malice PDF detection method according to claim 1, it is characterised in that: above-mentioned steps one specifically: use first Pdf document to be checked is converted into binary system byte file by PDFParser, then calculates the comentropy of each pdf document.
3. malice PDF detection method according to claim 1, it is characterised in that: above-mentioned steps three specifically: first with The structure of Origami analysis apocrypha and the general features for searching for malice feature and structure, then analyze apocrypha again JavaScript code simultaneously searches for malice feature.
4. malice PDF detection method according to claim 1, it is characterised in that: above-mentioned steps four specifically: first every A pdf document indicates that the vector is by the general features of structure, the behavioral characteristics of structure and JavaScript feature with a vector Composition;Then vector, classification C5.0 decision tree is input to classify.
5. a kind of malice PDF detection system, it is characterised in that: include:
Pdf document to be checked is converted into byte sequence by comentropy computing module, calculates the comentropy of each PDF part;
Screen module, maximum value, minimum value peace according to the information entropy of the malice pdf document and benign pdf document of statistics Threshold alpha is arranged in mean value and empirical value, the PDF text by the comentropy of each pdf document compared with threshold alpha, comentropy higher than α Part is as normal file, and the pdf document using comentropy lower than α is as apocrypha;
Analysis module is analyzed using Origami and extracts the JavaScript for being usually used in malicious attack in apocrypha and structure spy Sign;
Categorization module is classified using C5.0 decision Tree algorithms.
6. a kind of data storage device, it is characterised in that: including instruction, when run on a computer, so that computer is held The malice PDF detection method of any one of row claim 1-4.
7. a kind of detection program for the malice PDF detection method for realizing any one of claim 1-4.
CN201810832905.3A 2018-07-26 2018-07-26 Malice PDF detection method, system, data storage device and detection program Pending CN108959930A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810832905.3A CN108959930A (en) 2018-07-26 2018-07-26 Malice PDF detection method, system, data storage device and detection program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810832905.3A CN108959930A (en) 2018-07-26 2018-07-26 Malice PDF detection method, system, data storage device and detection program

Publications (1)

Publication Number Publication Date
CN108959930A true CN108959930A (en) 2018-12-07

Family

ID=64464972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810832905.3A Pending CN108959930A (en) 2018-07-26 2018-07-26 Malice PDF detection method, system, data storage device and detection program

Country Status (1)

Country Link
CN (1) CN108959930A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069927A (en) * 2019-04-22 2019-07-30 中国民航大学 Malice APK detection method, system, data storage device and detection program
CN110784561A (en) * 2019-09-30 2020-02-11 奇安信科技集团股份有限公司 IPv6 address segmentation method and similar site or link address set searching method
CN112231701A (en) * 2020-09-29 2021-01-15 广州威尔森信息科技有限公司 PDF file processing method and device
CN116578536A (en) * 2023-07-12 2023-08-11 北京安天网络安全技术有限公司 File detection method, storage medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN108287992A (en) * 2017-01-07 2018-07-17 长沙有干货网络技术有限公司 A kind of malicious program detection system of the computer learning based on Android

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287992A (en) * 2017-01-07 2018-07-17 长沙有干货网络技术有限公司 A kind of malicious program detection system of the computer learning based on Android
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAVIDE MAIORCA等: "A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files", 《2015 INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY (ICISSP)》 *
HIMANSHU PAREEK 等: "Entropy and n-gram Analysis of Malicious PDF Documents", 《INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY》 *
武雪峰: "恶意PDF文档的分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069927A (en) * 2019-04-22 2019-07-30 中国民航大学 Malice APK detection method, system, data storage device and detection program
CN110784561A (en) * 2019-09-30 2020-02-11 奇安信科技集团股份有限公司 IPv6 address segmentation method and similar site or link address set searching method
CN112231701A (en) * 2020-09-29 2021-01-15 广州威尔森信息科技有限公司 PDF file processing method and device
CN116578536A (en) * 2023-07-12 2023-08-11 北京安天网络安全技术有限公司 File detection method, storage medium and electronic device
CN116578536B (en) * 2023-07-12 2023-09-22 北京安天网络安全技术有限公司 File detection method, storage medium and electronic device

Similar Documents

Publication Publication Date Title
Darem et al. Visualization and deep-learning-based malware variant detection using OpCode-level features
Ni et al. Malware identification using visualization images and deep learning
Fan et al. Malicious sequential pattern mining for automatic malware detection
CN109145600B (en) System and method for detecting malicious files using static analysis elements
Liu et al. ATMPA: attacking machine learning-based malware visualization detection methods via adversarial examples
Gao et al. Malware classification for the cloud via semi-supervised transfer learning
Yan et al. A survey of adversarial attack and defense methods for malware classification in cyber security
CN113935033B (en) Feature fusion malicious code family classification method, device and storage medium
Jung et al. Android malware detection using convolutional neural networks and data section images
Sun et al. An opcode sequences analysis method for unknown malware detection
Saxe et al. A deep learning approach to fast, format-agnostic detection of malicious web content
CN110633570A (en) A defense method for black-box attack oriented to the detection model of malware assembly format
Meng et al. MCSMGS: malware classification model based on deep learning
CN108959930A (en) Malice PDF detection method, system, data storage device and detection program
CN110647745A (en) Detection method of malware assembly format based on deep learning
Kakisim et al. Sequential opcode embedding-based malware detection method
Li et al. An adversarial machine learning method based on opcode n-grams feature in malware detection
CN114003910B (en) Malicious variety real-time detection method based on dynamic graph comparison learning
Manavi et al. A new method for malware detection using opcode visualization
CN112580044B (en) System and method for detecting malicious files
Lu et al. Malicious word document detection based on multi-view features learning
Nahhas et al. Android Malware Detection Using ResNet-50 Stacking
Fu et al. A hybrid approach for Android malware detection using improved multi-scale convolutional neural networks and residual networks
Liu et al. A2-CLM: Few-Shot Malware Detection Based on Adversarial Heterogeneous Graph Augmentation
Hoang et al. Detecting Malware Based on Statistics and Machine Learning Using Opcode N-Grams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181207

RJ01 Rejection of invention patent application after publication