CN112241530A

CN112241530A - Malicious PDF document detection method and electronic equipment

Info

Publication number: CN112241530A
Application number: CN201910655086.4A
Authority: CN
Inventors: 祝跃飞; 芦斌; 何康; 刘龙; 林伟; 陈岩; 费金龙; 舒辉; 李红帅
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2021-01-19
Anticipated expiration: 2039-07-19
Also published as: CN112241530B

Abstract

The invention provides a detection method of a malicious PDF document and electronic equipment, wherein the detection method comprises the following steps: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure; performing feature extraction on object contents of nodes of the tree structure to obtain feature data; inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result; combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network; and the convolutional neural network outputs the detection result of the PDF document. According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, two-stage detection is carried out on the basis of the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved.

Description

Malicious PDF document detection method and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method for detecting a malicious PDF document and electronic equipment.

Background

Since Portable Document Format (PDF) is widely used for document exchange due to its high efficiency and good stability, PDF files have become an important carrier for network attacks. It is typically the case that phishing attacks are made using e-mail directed to governments and large enterprises. PDF files play an increasingly important role in recent network attacks, as most mail servers block executable files attached to e-mails for security reasons. Non-executable files are considered more secure than executable files by the average user, thereby reducing the suspicion of email receiving files. However, PDF files are as dangerous as executable files, and an attacker can gain illegal access rights to a host using vulnerabilities in document formats.

One important reason why PDF files are not secure comes from the rich functionality allowed by Adobe Reader (the most widely used PDF readers), especially its support for JavaScript. This functionality enhances the functionality of PDF documents, enabling the PDF to perform complex tasks such as form verification and calculation. However, it also provides an attacker with the ability to run arbitrary code by exploiting vulnerabilities in the Adobe JavaScript engine.

The traditional malicious PDF detection algorithm is relatively subjective in feature extraction, although the selected features have strong correlation with classification results and the classification accuracy on a test set is high, the result is established on the premise that the probability density distribution of a test sample on the selected features is the same as that on a training set. With the attacker's knowledge of the features used by the classifier, the attacker attempts to modify the sample and forge the feature values, and thus the accuracy of the classification will drop rapidly if the assumption is broken.

On the premise of higher PDF classification accuracy, the robustness of the classifier is improved. Even if an attacker can know partial design details of the classifier, the difficulty of the attacker in manufacturing the escape sample can be greatly increased, and the influence of hostile attack on the classifier is reduced.

In the related art, most detection techniques are based on signatures and rigid heuristics. Thus, they cannot detect files that make minor modifications to existing malicious files. Machine learning methods are popular in detecting spam, malware, and network intrusions, and they can also be used to classify PDF files. Existing machine learning algorithms use static and dynamic features to train PDF classification models. The difference is that static feature vectors can be obtained directly by processing documents, whereas dynamic feature vectors are obtained by monitoring the behavior of samples running in a built virtual environment. In general, static features have the disadvantage of being difficult to detect obfuscation and encryption and to hide more malicious code, while the acquisition of dynamic features requires the construction of a large number of heterogeneous operating environments that require a large resource overhead and are easily circumvented by time delays, interactive operations and other techniques. These models are excellent because they achieve high accuracy on the test data set. Using a model of path structure features achieves over 99% accuracy in the PDF malware classification task. However, the AutoEvader indicates that the detection system can be escaped by 100% by performing small structure modification on the malicious PDF file on the premise of not damaging the malicious function. Impersonation attacks and reverse impersonation attacks against machine learning-based classifiers are very effective.

Disclosure of Invention

The invention provides a method for detecting a malicious PDF document and electronic equipment, and aims to solve the technical problem of how to improve the accuracy of malicious PDF document detection.

The method for detecting the malicious PDF document comprises the following steps:

extracting a tree structure of a PDF document, and generating a structure matrix based on the tree structure;

performing feature extraction on the object content of the nodes of the tree structure to obtain feature data;

inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;

combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network;

and the convolutional neural network outputs the detection result of the PDF document.

According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, two-stage detection is carried out on the basis of the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved. The characteristic information of the PDF document in the aspect of the structure can be obtained by extracting the PDF tree structure and generating the structure matrix. By extracting the characteristics of the node contents, the characteristic information of the PDF document in the aspect of the contents can be obtained. Moreover, the feature information of the content aspect can be processed by a detection model to obtain a classification result. And finally, combining the classification result and the structural characteristics into an expansion matrix to be input into a convolutional neural network to obtain a detection result of the PDF document.

According to some embodiments of the invention, the extracting a tree structure of the PDF document and generating a structure matrix based on the tree structure comprises:

extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;

and classifying the nodes according to the object types of the nodes, and converting the adjacency matrix into the structural matrix based on the classification result.

In some embodiments of the present invention, the extracting the feature of the object content of the node of the tree structure to obtain the feature data includes:

replacing the object content of the node according to a preset rule to obtain replacement data;

extracting the characteristics of the replacement data by adopting a language model;

the features are selected based on their frequency of occurrence to generate feature data.

According to some embodiments of the present invention, the replacing the object content of the node according to a preset rule to obtain the replacement data includes:

classifying a character set of the contents of the node;

and establishing mapping characters corresponding to each type of character set, and replacing characters in each type of character set by the mapping characters.

In some embodiments of the invention, the number of mapping characters is less than 30.

According to some embodiments of the invention, the language model is an n-gram model.

In some embodiments of the present invention, the feature data is input into a pre-constructed detection model to be processed to obtain a classification result, including:

determining the clustering number and the clustering center by adopting a clustering method;

calculating the distance between the characteristic data and the clustering center of the corresponding category;

and obtaining the classification result according to the distance.

According to some embodiments of the invention, the method for generating the pre-constructed detection model comprises:

classifying the node based on an object type of the node;

and training each type of node based on a multi-center clustering method to obtain the detection model.

According to the computer-readable storage medium of the embodiment of the present invention, the computer-readable storage medium stores an implementation program of information transfer, and the program, when executed by a processor, implements the steps of the above-mentioned method for detecting a malicious PDF document.

According to the computer-readable storage medium of the embodiment of the invention, by executing the detection method of the malicious PDF document, two-stage detection is carried out based on the structure and the content of the PDF document, and the accuracy and the reliability of malicious PDF detection are effectively improved.

An electronic device according to an embodiment of the present invention includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of detecting a malicious PDF document as described above.

According to the electronic equipment provided by the embodiment of the invention, by executing the detection method of the malicious PDF document, two-stage detection is carried out based on the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved.

Drawings

Fig. 1 is a flowchart of a method of detecting a malicious PDF document according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method of generating a structure matrix according to an embodiment of the invention;

FIG. 3 is a flow diagram of a method of generating characterization data according to an embodiment of the present invention;

FIG. 4 is a flow diagram of a method of node character set replacement according to an embodiment of the present invention;

FIG. 5 is a flow diagram of a method of feature classification according to an embodiment of the invention;

FIG. 6 is a flow diagram of a detection model generation method according to an embodiment of the invention;

fig. 7 is a schematic structural diagram of a malicious PDF document detection model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a convolutional neural network algorithm according to an embodiment of the present invention.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, the method for detecting a malicious PDF document according to an embodiment of the present invention includes:

s101: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure;

it should be noted that, by extracting the tree structure of the PDF document, a structure matrix can be generated from the tree structure. Thereby, the characteristic information on the structure of the PDF document can be acquired.

S102: performing feature extraction on object contents of nodes of the tree structure to obtain feature data;

feature data is obtained by extracting features of object contents of nodes of the tree structure, and thereby feature information on contents of the PDF document can be acquired.

S103: inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;

by inputting the feature data into a pre-constructed detection model, the feature information of the PDF content can be classified, and a feature analysis result of the PDF content can be obtained.

S104: combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network;

that is, after the structural features and the content features of the PDF document are acquired, the structural features and the content features are combined as an input to the convolutional neural network. Therefore, the accuracy and reliability of detection of the malicious PDF document can be improved.

S105: and the convolutional neural network outputs the detection result of the PDF document.

In the present application, the execution sequence of the steps S101 to S105 is not limited. That is, in the present application, it is not necessary to execute the operations in order of S101 to S105.

As shown in fig. 2, according to some embodiments of the present invention, extracting a tree structure of a PDF document and generating a structure matrix based on the tree structure includes:

s201: extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;

s202: and classifying the nodes according to the object types of the nodes, and converting the adjacency matrix into a structural matrix based on the classification result.

Note that the PDF file has a tree-like logical structure, which is constituted by relationships between various basic objects. In the related art, the structural path feature, i.e., the vertical relationship from the root node to the leaf nodes, is focused, but the horizontal relationship is not noticed. The parallelism between nodes sharing the same parent or ancestor node will be lost. The invention takes comprehensive consideration of vertical and horizontal connections. The structure of a PDF file can be described by an adjacency matrix, which is a classical tool for describing graphs.

To extract local features of the structural matrix in the horizontal and vertical directions, we apply Convolutional Neural Networks (CNN) to the classification of PDF files. CNN achieves the most advanced performance in image classification, and can extract local features of an image using a convolution kernel. Furthermore, many techniques are applied to enhance the robustness of the classifier, and the nature of the PDF format limits the degrees of freedom of the elements in the structural matrix, which makes the classifier more robust. The structural matrix and the image are both two-dimensional arrays on the structure, so the convolutional neural network can capture the relation of the file structure in the horizontal and vertical directions due to the similarity of the structural matrix and the image.

The node objects of the PDF document include a plurality of types such as a font object, a page object, and the like. After the adjacency matrix is generated based on the tree structure, classification and combination can be performed according to the object types of the nodes, and the same and similar objects are combined into the same class. Similar objects as described herein may be understood as functionally similar objects, e.g., objects of different fonts may all be grouped into the same class. After the classification and merging processes, the dimension reduction can be performed on the adjacency matrix to obtain a structural matrix.

The extraction of the structural matrix is shown by the following algorithm 1:

algorithm 1 structural feature extraction

The PDF structure is first represented as a contiguous matrix, then types with similar functionality are merged into one, and then types with low frequencies are filtered. And finally, forming the structural matrix by the Cartesian products of the selected types. Therefore, the data processing amount of PDF document detection can be reduced, and the detection efficiency is improved.

In some embodiments of the present invention, as shown in fig. 3, performing feature extraction on object contents of nodes in a tree structure to obtain feature data includes:

s301: and replacing the object content of the node according to a preset rule to obtain replacement data.

It should be noted that, after the contents of the object are decrypted and decompressed, the contents of the node object can be obtained, and the contents of the object are replaced, so that the dimension of the data can be effectively reduced, which is beneficial to reducing the calculation amount of the PDF detection method and improving the detection efficiency of the PDF detection method.

S302: extracting the characteristics of the replacement data by adopting a language model;

for example, in some embodiments of the present invention, n-gram models may be employed to perform feature extraction on the replacement data. Wherein, the "n-gram model" is a mature language model in the field, and the specific implementation manner is not described herein again.

S303: the features are selected based on their frequency of occurrence to generate feature data.

It should be noted that the features may be selected according to the frequency of occurrence of the features. For example, a feature whose frequency of occurrence exceeds a threshold may be selected as the feature data. Thus, the feature data can be reduced in dimension.

For example, the following algorithm 2 can be adopted for feature extraction of JavaScript codes in PDF documents:

algorithm 2 feature extraction algorithm for Javascript

The JavaScript code is extracted from the sample and first processed according to the algorithm described above. Next, an n-gram method is applied to the replacement sequence and features for classification are generated. A threshold is then set to filter features in the training dataset that occur less frequently. The specific method is as shown in the above algorithm. And training the selected features by using a random deep forest and a support vector machine model to obtain a classifier model with high classification accuracy.

According to some embodiments of the present invention, as shown in fig. 4, replacing the object content of the node according to a preset rule to obtain replacement data includes:

s401: classifying a character set of contents of the nodes;

s402: and establishing mapping characters corresponding to each type of character set, and replacing the characters in each type of character set with the mapping characters.

In some embodiments of the invention, the number of mapping characters is less than 30. Thus, by substitution, the number of character types can be made smaller than 30, and the features are less sensitive to changes in code confusion, which not only improves robustness but also reduces the difficulty of feature dimension reduction.

It should be noted that, in the related art, n-gram analysis is directly performed on the byte sequence of the malware, but the application of the n-gram is meaningless due to the complex file format and coding. As n increases, the feature size explodes rapidly. For example, when n is 3, there are over two million features, which brings difficulties to feature selection and dimensionality reduction.

In addition, if the above method is applied, the modification of one character will likely result in many feature changes, which increases the sensitivity of feature vector values and reduces feature stability and robustness, and thus, the trained classifier model is easily escaped by simple code obfuscation.

In the present invention, in order to reduce the character space and reduce the influence of code confusion, the character set is classified and replaced with a type. The visible ASCII character set is classified, and a mapping is established for reducing the feature space of the visible character set from 128 to below 30.

For example, in the present application, the following replacement rules may be adopted to replace the character set of the content of the JavaScript code in the node:

type (B)	Examples of the invention	By replacement with
			Blank character	\n\r\t	none
Capital letter	A-Z	A
			Lower case letters	a-z	B
Number of	0-9	C
			Small bracket	()	D
Middle bracket	[]	E
			Brace bracket	{}	F
Comparison	><<＝>＝＝＝	G
			Separator symbol	,,.:；	H
Keyword	if else while for	I
			Operations	+-+＝-＝＝…	J
Logical operations	&&\|\|and or	K
			Quotation mark	’”	L

It can be understood that, by adopting the above replacement rule to replace the contents of the node, the purposes of reducing dimension and reducing the sensitivity of the change of the characteristic value caused by the change of a single character can be achieved.

In some embodiments of the present invention, as shown in fig. 5, the feature data is input into a pre-constructed detection model process to obtain a classification result, including:

s501: determining the clustering number and the clustering center by adopting a clustering method;

it should be noted that the clustering method may adopt a multi-center clustering method mature in the art, and the clustering number and the clustering center may be determined by the clustering method. The specific implementation process is a conventional technical means in the art, and is not described herein again.

S502: calculating the distance between the characteristic data and the clustering center of the corresponding category;

s503: and obtaining a classification result according to the distance.

That is, after the number of clusters and the cluster centers are acquired, the classification result of the feature data can be obtained by calculating the distance of the cluster centers of the type to which the feature data belongs. Therefore, the robustness of the classifier can be improved on the premise of high PDF classification accuracy. Even if an attacker can know partial design details of the classifier, the difficulty of the attacker in manufacturing the escape sample can be greatly increased, and the influence of hostile attack on the classifier is reduced.

According to some embodiments of the invention, as shown in fig. 6, a method for generating a pre-constructed detection model includes:

s601: classifying the nodes based on their object types;

s602: and training each type of node based on a multi-center clustering method to obtain a detection model.

It is noted that the objects are extracted after decoding and decryption and are first organized individually according to their type (e.g./Catalog,/Action, etc.). Unlike JavaScript code, it is difficult to determine whether malicious code is hidden therein because the number of malicious objects is very small compared to the number of objects of the entire file, and a large amount of manual identification is required to locate the malicious objects. Therefore, feature extraction is only applied to benign datasets. The feature extraction process for each type of object is shown in the following algorithm 3:

algorithm 3 feature extraction algorithm for different types of objects.

The basic steps are essentially the same as in algorithm 1 above, with the differences only running on benign sample sets. Further, according to the guidelines, features such as entropy are added for each type as redundant features for verification.

An anomaly detection model of a multi-center based clustering algorithm is trained for each type of object using the extracted features. Since the object content varies according to its function, multi-center clustering must be trained instead of a class of support vector machines (OSVM). The feature vector values are clustered using an algorithm like K-means and their distance to the center of the class to which they belong is calculated. The quantile of the distance is then determined as an indicator of the detection of the outlier.

It should be noted that the classification model for traditionally classified objects, such as image recognition, credit evaluation, etc., has a default assumption that the training data and the actual data share an approximate probability density distribution over their selected features. This assumption is easily satisfied because the training data is collected from the real world and the concept drift is small. However, due to the military competition between attackers and maintainers, the situation has changed when machine learning is applied to the field of network security such as malicious document classification. An attacker will manipulate the malicious sample to approximate the benign sample under the selected features without affecting the malicious functionality. This results in a classifier that is less robust and that has a reduced accuracy when the data set is replaced.

The invention proposes three guiding principles of feature selection:

causality: causality is used to measure the relationship between class labels and selected features. In general, features with high correlation are preferred in the training process. It is not a problem at the usual task, since features with high relevance can help the classifier to build a high-precision classification model without a competing attack. However, certain features are less causal to class labels. For example, the number of people drowning in a swimming pool is highly correlated with the average ice cream consumption, but causality between them is low because they are all caused by high temperatures. In the field of network security, features such as structural paths and metadata selected by classification systems like pdrate and Hidost have high relevance, but are less causal to class labels. Manual analysis finds that these functions are not necessarily related to the degree of maliciousness of the PDF. The methods realize high precision of more than 99 percent, and under the attack of EvaDemL, the accuracy is rapidly reduced to be close to 0 percent. Features such as shellcode, heap spray and JavaScript obfuscation are highly causal to the maliciousness of the sample, as they are essential requirements for functional implementation. Finding features with high causal relationships by class labels can be difficult, but deleting features with low causal relationships is relatively simple.

Impact resistance: an attacker attempts to modify a malicious sample to function close to a benign sample to evade the PDF malware classifier. To increase the cost of the attacker, we tend to choose features that are difficult to counterfeit, which are called crash resistance. In cryptography, collision resistance means that when a one-way hash function f (x) and a message m are given, it is difficult to find another message n that satisfies the condition f (m) ═ f (n). This concept is incorporated into the feature selection herein. Given a neighborhood δ, the feature extraction function f (x), the benign sample b, it is difficult to find a malicious sample m that satisfies the condition d (f (m), f (n) < δ, where d (x) is a measure of the distance of the measurement vector, such as L1, L2, or L ∞. The high collision resistance requires a one-way feature extraction function, and feature vectors are easily obtained from PDF samples, but corresponding content is difficult to recover from the feature vectors.

Redundancy: when oriented to high-dimensional feature data, conventional machine learning algorithms tend to eliminate redundant features and retain relatively independent features through PCA (principal component analysis) and other dimension reduction methods prior to training. High dimensional data leads to data sparsity, and training the model increases overfitting, which negatively impacts generalization ability. However, an attacker always tries to obtain information about the classifier and modify the malicious sample accordingly. Assuming that an attacker cannot obtain all feature information, the present invention adds additional features to detect whether there is a potential attack on the classifiers, which are referred to as redundant features. The intuition of the redundancy feature comes from the Cyclic Redundancy Check (CRC) in the field of data communications. The CRC is a data transmission error detection function that performs polynomial calculation on data and appends the result to a frame. The receiving device also executes to check whether the data has been modified to ensure integrity. In the field of feature selection, the present invention proposes the concept of feature redundancy. The polynomial calculation of the other (partial) feature values will be considered as an additional feature of the existing feature set. When only partial feature values are masked, the redundant features will show a large difference. Functional redundancy provides verification of the original function, thus increasing the computational complexity of an attacker who cannot know all functional knowledge.

In addition, in the present application, the Javascript code may be directly detected, including:

replacing the code content according to rules;

extracting features by using an n-gram method, and selecting and generating the features according to an instructive principle;

the classification model is trained based on the supervised machine learning algorithm, and the accuracy rate of malicious Javascript codes which can be effectively extracted can be almost 100%. A classifier using content-based features can effectively distinguish the types of vulnerabilities used.

Therefore, by providing three guiding principles of feature selection, fully utilizing structural features and content features and the local correlation of the transverse and longitudinal relations of different types of features, training a classifier based on a two-stage machine learning algorithm and Javascript codes, and enhancing the robustness of the classifier on the basis of keeping higher accuracy.

An electronic device according to an embodiment of the present invention includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method for detecting a malicious PDF document as described above.

According to the electronic device of the embodiment of the invention, by adopting the method for detecting the malicious PDF document, the method comprises the following steps:

as shown in fig. 7, in the first stage, feature extraction is performed by using an n-gram method, feature selection and feature generation are performed according to guiding principles, and the structural features and the clustering features are used as training in the first stage.

Training an anomaly detection model based on a multi-center clustering method for each type of object;

merging the tree structures according to the similarity of the types and generating a structure matrix;

in the second phase, the above trained model is first applied to the input data set, and then combined with the model output of the structural matrix to form an extended structural matrix, which is used as input to the CNN algorithm. The structure of CNN has the ability to capture local features and maintain different types of connections according to spreading matrices, and the specific CNN structure used is shown in fig. 8. And combining the classification result of the training data through the first-stage model and the structure into an expansion matrix as the input of the convolutional neural network model and training.

By the method, the accuracy of identifying the general PDF document can be improved. In order to make more use of the content feature information, a classifier is trained on the dataset with the vulnerability number tag to classify vulnerabilities used in malicious files. Only a small fraction of the samples are used for training and the classifier is tested on the entire data set. The test result shows that the accuracy rate of classifier identification exceeds 97%.

According to the computer-readable storage medium of the embodiment of the invention, the computer-readable storage medium stores an implementation program of information transmission, and the program realizes the steps of the above-mentioned malicious PDF document detection method when executed by the processor.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.

Claims

1. A method for detecting a malicious PDF document is characterized by comprising the following steps:

2. The method for detecting the malicious PDF document according to claim 1, wherein the extracting a tree structure of the PDF document and generating a structure matrix based on the tree structure comprises:

3. The method for detecting a malicious PDF document according to claim 1, wherein said extracting the features of the object contents of the nodes of the tree structure to obtain feature data comprises:

4. The method for detecting the malicious PDF document according to claim 3, wherein the step of replacing the object content of the node according to a preset rule to obtain the replacement data comprises:

classifying a character set of the contents of the node;

5. The method for detecting the malicious PDF document according to claim 4, wherein the number of the mapping characters is less than 30.

6. The method according to claim 3, wherein the language model is an n-gram model.

7. The method for detecting the malicious PDF document according to claim 1, wherein the feature data is input into a detection model which is constructed in advance and processed to obtain a classification result, and the method comprises the following steps:

and obtaining the classification result according to the distance.

8. The method for detecting the malicious PDF document according to claim 1, wherein the method for generating the pre-constructed detection model comprises the following steps:

classifying the node based on an object type of the node;

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an information transfer implementing program, which when executed by a processor implements the steps of the method for detecting a malicious PDF document according to any one of claims 1 to 8.

10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of detecting a malicious PDF document according to any one of claims 1 to 8.