CN112241530B

CN112241530B - Malicious PDF document detection method and electronic equipment

Info

Publication number: CN112241530B
Application number: CN201910655086.4A
Authority: CN
Inventors: 祝跃飞; 芦斌; 何康; 刘龙; 林伟; 陈岩; 费金龙; 舒辉; 李红帅
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2023-05-30
Anticipated expiration: 2039-07-19
Also published as: CN112241530A

Abstract

The invention provides a detection method of a malicious PDF document and electronic equipment, wherein the detection method comprises the following steps: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure; extracting the characteristics of the object content of the nodes of the tree structure to obtain characteristic data; inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result; merging the classification result and the structure matrix into an expansion matrix and inputting the expansion matrix into a convolutional neural network; the convolutional neural network outputs the detection result of the PDF document. According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, the detection of two stages is performed based on the structure and the content of the PDF document, so that the accuracy and the reliability of the detection of the malicious PDF are effectively improved.

Description

Malicious PDF document detection method and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method for detecting a malicious PDF document and an electronic device.

Background

Since Portable Document Format (PDF) is widely used for document exchange due to its high efficiency and stability, PDF files have become an important carrier for network attacks. A typical scenario is phishing attacks using emails for governments and large businesses. PDF files play an increasingly important role in recent network attacks, since most mail servers will block executable files attached to e-mail for security reasons. The non-executable files are considered safer by the average user than the executable files, thereby reducing the suspicion of email receiving files. However, PDF files are as dangerous as executable files, and an attacker can gain illegal access rights to a host using vulnerabilities in the document format.

An important reason for the unsafe PDF files comes from the rich functionality allowed by Adobe Reader (the most widely used PDF Reader), especially its support for JavaScript. This functionality enhances the functionality of the PDF document, enabling the PDF to perform complex tasks such as form verification and computation. However, it also provides the attacker the ability to run arbitrary code by exploiting vulnerabilities in the Adobe JavaScript engine.

The traditional malicious PDF detection algorithm is relatively subjective in feature extraction, and the selected features and the classification result have stronger correlation, so that the classification accuracy on the test set is higher, but the premise of the result is that the test sample has the same probability density distribution on the selected features and the training set. In the case that an attacker knows the features used by the classifier, the attacker tries to modify the sample and falsify the feature values, so that the classification accuracy drops rapidly if the assumption is broken.

On the premise of higher PDF classification accuracy, the robustness of the classifier is improved. Even if an attacker can know part of design details of the classifier, the difficulty of manufacturing escape samples by the attacker can be greatly increased, and the influence of hostile attacks on the classifier is reduced.

In the related art, most detection techniques are based on signature and rigidity heuristics. Therefore, they cannot detect files that make minor modifications to existing malicious files. Machine learning methods are popular in detecting spam, malware, and network intrusions, which can also be used to classify PDF files. Existing machine learning algorithms employ static and dynamic features to train PDF classification models. The difference is that static feature vectors can be obtained directly by processing the document, while dynamic feature vectors are obtained by monitoring the behavior of samples running in the constructed virtual environment. In general, static features have the disadvantage that it is difficult to detect confusion and encryption and to hide deep malicious code, while acquisition of dynamic features requires the construction of a large number of heterogeneous operating environments that require a large amount of resource overhead and are easily circumvented by time delays, interactive operations and other techniques. These models are excellent because they achieve high accuracy on the test dataset. The use of a model of path structural features achieves greater than 99% accuracy in PDF malware classification tasks. However, autoEvader shows that the detection system can be escaped 100% by carrying out the micro-structure modification on the malicious PDF file on the premise of not damaging the malicious function. Mimicking and reverse mimicking attacks against machine learning based classifiers are very efficient.

Disclosure of Invention

The invention provides a malicious PDF document detection method and electronic equipment, and aims to solve the technical problem of improving the accuracy of malicious PDF document detection.

The detection method for the malicious PDF document comprises the following steps:

extracting a tree structure of a PDF document, and generating a structure matrix based on the tree structure;

extracting the characteristics of the object content of the nodes of the tree structure to obtain characteristic data;

inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;

combining the classification result and the structure matrix into an expansion matrix and inputting the expansion matrix into a convolutional neural network;

and the convolutional neural network outputs the detection result of the PDF document.

According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, the detection of two stages is performed based on the structure and the content of the PDF document, so that the accuracy and the reliability of the detection of the malicious PDF are effectively improved. The feature information of the PDF document in terms of structure can be obtained by extracting the PDF tree structure and generating a structure matrix. By extracting the characteristics of the node content, the characteristic information of the content aspect of the PDF document can be obtained. Moreover, the feature information of the content aspect can be processed by a detection model to obtain a classification result. And finally, merging the classification result and the structural feature into an expansion matrix, and inputting the expansion matrix into a convolutional neural network to obtain the detection result of the PDF document.

According to some embodiments of the invention, the extracting the tree structure of the PDF document and generating the structure matrix based on the tree structure includes:

extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;

classifying the nodes according to the object types of the nodes, and converting the adjacent matrix into the structure matrix based on classification results.

In some embodiments of the present invention, the feature extracting the object content of the node of the tree structure to obtain feature data includes:

replacing the object content of the node according to a preset rule to obtain replacement data;

extracting features of the replacement data by adopting a language model;

and selecting the features based on the occurrence frequency of the features to generate feature data.

According to some embodiments of the invention, the replacing the object content of the node according to a preset rule to obtain replacement data includes:

classifying a character set of the content of the node;

and establishing mapping characters corresponding to each type of character set, and replacing the characters in each type of character set by adopting the mapping characters.

In some embodiments of the invention, the number of mapping characters is less than 30.

According to some embodiments of the invention, the language model is an n-gram model.

In some embodiments of the present invention, the feature data is input into a pre-constructed detection model for processing to obtain a classification result, including:

determining the clustering number and the clustering center by adopting a clustering method;

calculating the distance between the characteristic data and the clustering center of the corresponding category;

and obtaining the classification result according to the distance.

According to some embodiments of the invention, the method for generating the pre-constructed detection model includes:

classifying the nodes based on their object types;

and training each type of node based on a multi-center clustering method to obtain the detection model.

According to the computer readable storage medium of the embodiment of the invention, an information transmission implementation program is stored on the computer readable storage medium, and the steps of the malicious PDF document detection method are implemented when the program is executed by a processor.

According to the computer readable storage medium, the detection method of the malicious PDF document is executed, and two-stage detection is carried out based on the structure and the content of the PDF document, so that the accuracy and the reliability of the detection of the malicious PDF are effectively improved.

An electronic device according to an embodiment of the present invention includes: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is executed by the processor to realize the steps of the method for detecting the malicious PDF document.

According to the electronic equipment provided by the embodiment of the invention, the detection method of the malicious PDF document is executed, and the detection of two stages is carried out based on the structure and the content of the PDF document, so that the accuracy and the reliability of the detection of the malicious PDF are effectively improved.

Drawings

FIG. 1 is a flow chart of a method of detecting a malicious PDF document according to an embodiment of the invention;

FIG. 2 is a flow chart of a method of generating a structural matrix according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of generating feature data according to an embodiment of the invention;

FIG. 4 is a flow chart of a method of node character set replacement according to an embodiment of the invention;

FIG. 5 is a flow chart of a method of feature classification according to an embodiment of the invention;

FIG. 6 is a flow chart of a detection model generation method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a malicious PDF document detection model according to an embodiment of the invention;

fig. 8 is a schematic diagram of a convolutional neural network algorithm according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention for achieving the intended purpose, the following detailed description of the present invention is given with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, a method for detecting a malicious PDF document according to an embodiment of the present invention includes:

s101: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure;

it should be noted that, by extracting the tree structure of the PDF document, a structure matrix may be generated from the tree structure. Thus, feature information of structural aspects of the PDF document can be acquired.

S102: extracting the characteristics of the object content of the nodes of the tree structure to obtain characteristic data;

the feature data is obtained by extracting the features of the object content of the nodes of the tree structure, and thus, the feature information of the PDF document in terms of the content can be obtained.

S103: inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;

the feature data is input into a detection model constructed in advance for processing, so that the feature information in the aspect of PDF content can be classified, and a feature analysis result in the aspect of PDF content can be obtained.

S104: merging the classification result and the structure matrix into an expansion matrix and inputting the expansion matrix into a convolutional neural network;

that is, after the structural features and the content features of the PDF document are acquired, the structural features and the content features are combined as inputs to the convolutional neural network. Thus, the accuracy and reliability of detection of malicious PDF documents can be improved.

S105: the convolutional neural network outputs the detection result of the PDF document.

The execution order of the steps S101 to S105 is not limited in this application. That is, in the present application, it is not necessary to perform sequentially in the order of S101 to S105.

As shown in fig. 2, according to some embodiments of the present invention, extracting a tree structure of a PDF document and generating a structure matrix based on the tree structure includes:

s201: extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;

s202: classifying the nodes according to the object types of the nodes, and converting the adjacent matrix into a structural matrix based on the classification result.

The PDF file has a tree logical structure composed of relationships between various basic objects. In the related art, the structural path feature, i.e., the vertical relationship from the root node to the leaf node, is focused, but the horizontal relationship is not focused. The parallel relationship between nodes sharing the same parent or ancestor node will be lost. The present invention takes into account both vertical and horizontal connections. The structure of a PDF file can be described by an adjacency matrix, which is a classical tool to describe a graph.

To extract local features of the structural matrix in the horizontal and vertical directions, we apply Convolutional Neural Networks (CNNs) to the classification of PDF files. CNNs achieve the most advanced performance in terms of image classification, and convolution kernels can be used to extract local features of images. In addition, many techniques are applied to enhance the robustness of the classifier, and the nature of the PDF format limits the degree of freedom of the elements in the structural matrix, which makes the classifier more robust. Both the structural matrix and the image are two-dimensional arrays of structures, so convolutional neural networks can also capture the relationship of the file structure in the horizontal and vertical directions due to their similarity.

The node objects of the PDF document include a plurality of types such as font objects, page objects, and the like. After generating the adjacency matrix based on the tree structure, the adjacency matrix can be classified and combined according to the object types of the nodes, and the same and similar objects are combined into the same class. Similar objects described herein may be understood as functionally similar objects, e.g., objects of different fonts may all be categorized into the same class. After classification and merging processing, the adjacent matrix can be subjected to dimension reduction to obtain a structural matrix.

The extraction of the structural matrix is shown in the following algorithm 1:

algorithm 1 extraction of structural features

The PDF structure is first represented as a adjacency matrix, then the types with similar functions are merged into one, and then the types with low frequencies are filtered. Finally, the Cartesian products of the selected type are formed into a structural matrix. Thus, the data processing amount of PDF document detection can be reduced, and the detection efficiency can be improved.

In some embodiments of the present invention, as shown in fig. 3, feature extraction is performed on object content of a node of a tree structure to obtain feature data, including:

s301: and replacing the object content of the node according to a preset rule to obtain replacement data.

After decrypting and decompressing the object content, the node object content can be obtained, and the data can be effectively reduced in dimension by replacing the object content, so that the calculation amount of the PDF detection method is reduced, and the detection efficiency of the PDF detection method is improved.

S302: extracting features of the replacement data by adopting a language model;

for example, in some embodiments of the invention, an n-gram model may be employed to perform feature extraction on the replacement data. The "n-gram model" is a mature language model in the art, and the specific implementation manner is not described herein.

S303: the features are selected based on the frequency of occurrence of the features to generate feature data.

It should be noted that the feature may be selected according to the frequency of occurrence of the feature. For example, a feature whose frequency of occurrence exceeds a threshold value may be selected as the feature data. Thus, the feature data can be reduced in dimension.

For example, feature extraction of JavaScript code in a PDF document may employ algorithm 2 as follows:

feature extraction algorithm of algorithm 2 on Javascript

The JavaScript code is extracted from the sample and processed according to the algorithm. Second, an n-gram method is applied to the replacement sequence and generates features for classification. A threshold is then set to filter the less frequent occurrence of features in the training dataset. The specific method is shown in the algorithm. Training the selected features by using a random deep forest and support vector machine model to obtain a classifier model with high classification accuracy.

According to some embodiments of the present invention, as shown in fig. 4, replacing object contents of a node according to a preset rule to obtain replacement data includes:

s401: classifying character sets of contents of the nodes;

s402: and establishing mapping characters corresponding to each type of character set, and replacing the characters in each type of character set by adopting the mapping characters.

In some embodiments of the invention, the number of mapped characters is less than 30. Thus, by substitution, the number of character types can be made smaller than 30, and the feature is less sensitive to variations in code confusion, which not only improves robustness, but also reduces the difficulty in feature dimension reduction.

In the related art, n-gram analysis is directly performed on a byte sequence of malicious software, but due to a complex file format and encoding, the application of n-gram is not significant. As n increases, the feature size rapidly explodes. For example, when n=3, there are more than two million features, which makes feature selection and dimension reduction difficult.

In addition, if the above method is applied, the modification of one character will likely result in many feature changes, which increases the sensitivity of the feature vector values and decreases the feature stability and robustness, thus easily escaping the trained classifier model by simple code confusion.

In the present invention, to reduce the character space and reduce the influence of code confusion, character sets are classified and replaced with types. Classifying the visible ASCII code character set, and establishing a mapping for reducing the feature space of the visible character set from 128 to below 30.

For example, the substitution rules in the following table may be used in the present application to replace the character set of the content of JavaScript code in the node:

type(s)	Example	Replaced by
			Blank character	\n\r\t	none
Capital letter	A-Z	A
			Lowercase letters	a-z	B
Digital number	0-9	C
			Bracket	()	D
Middle bracket	[]	E
			Bracket	{}	F
Comparison of	><<＝>＝＝＝	G
			Separator symbol	,,.:；	H
Keyword(s)	if else while for	I
			Calculation of	+-+＝-＝＝…	J
Logic operation	&&\|\|and or	K
			Quotation mark	’”	L

It will be appreciated that by replacing the content of the node using the replacement rules described above, the objective of reducing the sensitivity to changes in the eigenvalues caused by individual character variations can be achieved.

In some embodiments of the present invention, as shown in fig. 5, the feature data is input into a pre-constructed detection model to be processed to obtain a classification result, including:

s501: determining the clustering number and the clustering center by adopting a clustering method;

it should be noted that, the clustering method may be a multi-center clustering method mature in the art, and the number of clusters and the cluster center may be determined by the clustering method. The specific implementation process is a conventional technical means in the field, and will not be described in detail herein.

S502: calculating the distance between the characteristic data and the clustering center of the corresponding category;

s503: and obtaining a classification result according to the distance.

That is, after the number of clusters and the cluster center are acquired, the classification result of the feature data can be obtained by calculating the distance of the cluster center of the type to which the feature data belongs. Therefore, on the premise of higher PDF classification accuracy, the robustness of the classifier can be improved. Even if an attacker can know part of design details of the classifier, the difficulty of manufacturing escape samples by the attacker can be greatly increased, and the influence of hostile attacks on the classifier is reduced.

According to some embodiments of the present invention, as shown in fig. 6, a method for generating a pre-constructed detection model includes:

s601: classifying the nodes based on the object types of the nodes;

s602: and training each type of node based on a multi-center clustering method to obtain a detection model.

It should be noted that the objects are extracted after decoding and decryption and are first organized individually according to their type (e.g.,/Catalog,/Action, etc.). Unlike JavaScript code, it is difficult to determine whether malicious code is hidden therein, because the number of malicious objects is very small compared to the number of objects of the entire file, and a large number of manual identifications are required to locate the malicious objects. Thus, feature extraction is only applied to benign datasets. The feature extraction process for each type of object is shown in the following algorithm 3:

algorithm 3 feature extraction algorithm for different types of objects.

The basic steps are essentially the same as algorithm 1 above, with the differences only running on benign sample sets. Furthermore, according to the guidelines, a feature such as entropy is added for each type as a redundant feature for verification.

An anomaly detection model based on a multi-center clustering algorithm is trained for each type of object using the extracted features. Since the object content varies according to its function, a multi-center cluster must be trained instead of a class of support vector machines (OSVM). The eigenvector values are clustered using an algorithm like K-means and their distance to the center of the belonging class is calculated. Then, the quantile of the distance is determined as an index for detecting an abnormal value.

It should be noted that the classification model of a conventionally classified object, such as image recognition, credit assessment, etc., has a default assumption that training data and actual data share an approximate probability density distribution over their selected features. This assumption is easily satisfied because training data is collected from the real world with little drift in concepts. However, due to the army competition between the attacker and maintainer, situations have changed when machine learning was applied to the field of network security such as malicious document classification. An attacker will manipulate malicious samples to approach benign samples under the selected features without affecting malicious functionality. This results in poor robustness of the classifier and a rapid decrease in classification accuracy after a data set change.

The invention provides three guiding principles of feature selection:

causality: causality is used to measure the relationship between class labels and selected features. In general, features with high correlation are preferred during training. This is not a problem in the usual task, as features with high correlation can help the classifier build a high-precision classification model without the need for a resistance attack. However, some features are less causal to class labels. For example, the number of drowning in a swimming pool has a high correlation with the average ice cream consumption, but the causality between them is low, since they are all caused by high temperatures. In the field of network security, features such as structural paths and metadata selected by classification systems such as PDFRate and Hidost have high correlation, but are less causal with class labels. Manual analysis has found that these functions are not necessarily related to the malicious degree of PDF. The methods realize high precision of over 99 percent, and the accuracy is rapidly reduced to be close to 0 percent under the attack of EvadeML. Features such as shellcode, heap injection, and JavaScript obfuscation are highly causal to sample maliciousness, as they are an essential requirement for functional implementation. Finding features with high causality through class labels may be difficult, but deleting features with low causality is relatively simple.

Crash resistance: an attacker tries to modify the malicious samples to make their function close to benign samples to evade the PDF malware classifier. To increase the cost of an attacker, we tend to choose features that are difficult to imitate, which are called collision resistance. In cryptography, collision resistance means that when given a one-way hash function f (x) and a message m, it is difficult to find another message n that satisfies the condition f (m) =f (n). This concept is incorporated into the feature selection herein. Given a neighborhood δ, a feature extraction function f (x), benign samples b, it is difficult to find malicious samples m that satisfy the condition d (f (m), f (n)) < δ, where d (x) is a measure of the distance of the measurement vector, such as L1, L2 or L infinity. The high collision resistance requires a unidirectional feature extraction function, and it is easy to obtain feature vectors from PDF samples, but it is difficult to recover the corresponding contents from the feature vectors.

Redundancy: when oriented to high-dimensional feature data, conventional machine learning algorithms tend to eliminate redundant features and preserve relatively independent features by PCA (principal component analysis) and other dimension reduction methods prior to training. The high dimensionality results in data sparsity and training the model increases the overfitting, which has a negative impact on generalization ability. However, an attacker always tries to obtain information about the classifier and modify the malicious samples accordingly. Assuming that an attacker cannot obtain all feature information, the present invention adds additional features to detect whether there is a potential attack on the classifier, which is called a redundant feature. The intuition of redundancy features comes from Cyclic Redundancy Check (CRC) in the field of data communications. The CRC is a data transmission error detection function that performs polynomial computation on data and appends the result to the frame. The receiving device also performs to check whether the data has been modified to ensure integrity. In the field of feature selection, the invention proposes the concept of feature redundancy. Polynomial computation of other (partial) feature values will be considered as an additional feature of the existing feature set. When only part of the feature values are masked, the redundant features will show a large variance. Functional redundancy provides verification of the original function, thus increasing the computational complexity of an attacker who cannot learn the knowledge of all functions.

In addition, in the present application, direct detection of Javascript code may be performed, including:

replacing the code content according to rules;

extracting features by using an n-gram method, and carrying out feature selection and feature generation according to a guiding principle;

the classification model is trained based on a supervised machine learning algorithm, and the accuracy of the malicious Javascript codes which can be effectively extracted can be almost 100%. Using a classifier based on content features can effectively distinguish the types of vulnerabilities used.

Therefore, three guiding principles of feature selection are provided, structural features, content features and local correlation of transverse and longitudinal connection of different types of features are fully utilized, a classifier based on a two-stage machine learning algorithm and aiming at Javascript codes is trained, and the robustness of the classifier is enhanced on the basis of keeping high accuracy.

An electronic device according to an embodiment of the present invention includes: the method comprises the steps of a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is executed by the processor to realize the detection method of the malicious PDF document.

According to the electronic equipment provided by the embodiment of the invention, the malicious PDF document detection method is adopted:

as shown in fig. 7, the first stage uses an n-gram method for feature extraction, performs feature selection and feature generation according to a guiding principle, and takes structural features and cluster features as training of the first stage.

Training an anomaly detection model based on a multi-center clustering method for each type of object;

merging the tree structures according to the similarity of the types and generating a structure matrix;

in the second stage, the above trained model is first applied to the input dataset and then the model output of the structure matrix is combined to construct an extended structure matrix as input to the CNN algorithm. The structure of CNN has the ability to capture local features and maintain different types of connections according to the expansion matrix, and the specific CNN structure used is shown in fig. 8. And merging the training data into an expansion matrix through the classification result of the first-stage model and the structure, and taking the expansion matrix as the input of the convolutional neural network model for training.

By the mode, the accuracy of the identification of the universal PDF document can be improved. To make more use of content feature information, a classifier is trained on the dataset with vulnerability number labels to classify vulnerabilities used in malicious files. Only a small portion of the samples are used for training and the classifier is tested over the entire dataset. Test results show that the accuracy of classifier identification exceeds 97%.

According to the computer readable storage medium of the embodiment of the invention, an implementation program for information transmission is stored on the computer readable storage medium, and when the program is executed by a processor, the steps of the method for detecting the malicious PDF document are implemented.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that these drawings are included in the spirit and scope of the invention, it is not to be limited thereto.

Claims

1. A method for detecting a malicious PDF document, comprising:

the convolutional neural network outputs the detection result of the PDF document;

the extracting the tree structure of the PDF document and generating a structure matrix based on the tree structure comprises the following steps:

classifying the nodes according to the object types of the nodes, and converting the adjacent matrix into the structure matrix based on classification results;

the method for generating the pre-constructed detection model comprises the following steps:

classifying the nodes based on their object types;

2. The method for detecting a malicious PDF document according to claim 1, wherein the feature extracting the object content of the node of the tree structure to obtain feature data includes:

extracting features of the replacement data by adopting a language model;

3. The method for detecting a malicious PDF document according to claim 2, wherein the replacing the object content of the node according to a preset rule to obtain replacement data includes:

classifying a character set of the content of the node;

4. A method of detecting a malicious PDF document according to claim 3, wherein the number of mapped characters is less than 30.

5. The method for detecting a malicious PDF document according to claim 2, wherein the language model is an n-gram model.

6. The method for detecting a malicious PDF document according to claim 1, wherein the feature data is input into a detection model constructed in advance to be processed to obtain a classification result, comprising:

and obtaining the classification result according to the distance.

7. A computer-readable storage medium, wherein a program for realizing information transfer is stored on the computer-readable storage medium, which when executed by a processor, realizes the steps of the method for detecting a malicious PDF document according to any one of claims 1 to 6.

8. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method of detecting a malicious PDF document according to any one of claims 1 to 6.