CN112241530A - Malicious PDF document detection method and electronic equipment - Google Patents

Malicious PDF document detection method and electronic equipment Download PDF

Info

Publication number
CN112241530A
CN112241530A CN201910655086.4A CN201910655086A CN112241530A CN 112241530 A CN112241530 A CN 112241530A CN 201910655086 A CN201910655086 A CN 201910655086A CN 112241530 A CN112241530 A CN 112241530A
Authority
CN
China
Prior art keywords
pdf document
malicious
tree structure
detecting
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910655086.4A
Other languages
Chinese (zh)
Other versions
CN112241530B (en
Inventor
祝跃飞
芦斌
何康
刘龙
林伟
陈岩
费金龙
舒辉
李红帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN201910655086.4A priority Critical patent/CN112241530B/en
Publication of CN112241530A publication Critical patent/CN112241530A/en
Application granted granted Critical
Publication of CN112241530B publication Critical patent/CN112241530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Hardware Design (AREA)
  • Virology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computer And Data Communications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a detection method of a malicious PDF document and electronic equipment, wherein the detection method comprises the following steps: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure; performing feature extraction on object contents of nodes of the tree structure to obtain feature data; inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result; combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network; and the convolutional neural network outputs the detection result of the PDF document. According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, two-stage detection is carried out on the basis of the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved.

Description

Malicious PDF document detection method and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a method for detecting a malicious PDF document and electronic equipment.
Background
Since Portable Document Format (PDF) is widely used for document exchange due to its high efficiency and good stability, PDF files have become an important carrier for network attacks. It is typically the case that phishing attacks are made using e-mail directed to governments and large enterprises. PDF files play an increasingly important role in recent network attacks, as most mail servers block executable files attached to e-mails for security reasons. Non-executable files are considered more secure than executable files by the average user, thereby reducing the suspicion of email receiving files. However, PDF files are as dangerous as executable files, and an attacker can gain illegal access rights to a host using vulnerabilities in document formats.
One important reason why PDF files are not secure comes from the rich functionality allowed by Adobe Reader (the most widely used PDF readers), especially its support for JavaScript. This functionality enhances the functionality of PDF documents, enabling the PDF to perform complex tasks such as form verification and calculation. However, it also provides an attacker with the ability to run arbitrary code by exploiting vulnerabilities in the Adobe JavaScript engine.
The traditional malicious PDF detection algorithm is relatively subjective in feature extraction, although the selected features have strong correlation with classification results and the classification accuracy on a test set is high, the result is established on the premise that the probability density distribution of a test sample on the selected features is the same as that on a training set. With the attacker's knowledge of the features used by the classifier, the attacker attempts to modify the sample and forge the feature values, and thus the accuracy of the classification will drop rapidly if the assumption is broken.
On the premise of higher PDF classification accuracy, the robustness of the classifier is improved. Even if an attacker can know partial design details of the classifier, the difficulty of the attacker in manufacturing the escape sample can be greatly increased, and the influence of hostile attack on the classifier is reduced.
In the related art, most detection techniques are based on signatures and rigid heuristics. Thus, they cannot detect files that make minor modifications to existing malicious files. Machine learning methods are popular in detecting spam, malware, and network intrusions, and they can also be used to classify PDF files. Existing machine learning algorithms use static and dynamic features to train PDF classification models. The difference is that static feature vectors can be obtained directly by processing documents, whereas dynamic feature vectors are obtained by monitoring the behavior of samples running in a built virtual environment. In general, static features have the disadvantage of being difficult to detect obfuscation and encryption and to hide more malicious code, while the acquisition of dynamic features requires the construction of a large number of heterogeneous operating environments that require a large resource overhead and are easily circumvented by time delays, interactive operations and other techniques. These models are excellent because they achieve high accuracy on the test data set. Using a model of path structure features achieves over 99% accuracy in the PDF malware classification task. However, the AutoEvader indicates that the detection system can be escaped by 100% by performing small structure modification on the malicious PDF file on the premise of not damaging the malicious function. Impersonation attacks and reverse impersonation attacks against machine learning-based classifiers are very effective.
Disclosure of Invention
The invention provides a method for detecting a malicious PDF document and electronic equipment, and aims to solve the technical problem of how to improve the accuracy of malicious PDF document detection.
The method for detecting the malicious PDF document comprises the following steps:
extracting a tree structure of a PDF document, and generating a structure matrix based on the tree structure;
performing feature extraction on the object content of the nodes of the tree structure to obtain feature data;
inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;
combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network;
and the convolutional neural network outputs the detection result of the PDF document.
According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, two-stage detection is carried out on the basis of the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved. The characteristic information of the PDF document in the aspect of the structure can be obtained by extracting the PDF tree structure and generating the structure matrix. By extracting the characteristics of the node contents, the characteristic information of the PDF document in the aspect of the contents can be obtained. Moreover, the feature information of the content aspect can be processed by a detection model to obtain a classification result. And finally, combining the classification result and the structural characteristics into an expansion matrix to be input into a convolutional neural network to obtain a detection result of the PDF document.
According to some embodiments of the invention, the extracting a tree structure of the PDF document and generating a structure matrix based on the tree structure comprises:
extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;
and classifying the nodes according to the object types of the nodes, and converting the adjacency matrix into the structural matrix based on the classification result.
In some embodiments of the present invention, the extracting the feature of the object content of the node of the tree structure to obtain the feature data includes:
replacing the object content of the node according to a preset rule to obtain replacement data;
extracting the characteristics of the replacement data by adopting a language model;
the features are selected based on their frequency of occurrence to generate feature data.
According to some embodiments of the present invention, the replacing the object content of the node according to a preset rule to obtain the replacement data includes:
classifying a character set of the contents of the node;
and establishing mapping characters corresponding to each type of character set, and replacing characters in each type of character set by the mapping characters.
In some embodiments of the invention, the number of mapping characters is less than 30.
According to some embodiments of the invention, the language model is an n-gram model.
In some embodiments of the present invention, the feature data is input into a pre-constructed detection model to be processed to obtain a classification result, including:
determining the clustering number and the clustering center by adopting a clustering method;
calculating the distance between the characteristic data and the clustering center of the corresponding category;
and obtaining the classification result according to the distance.
According to some embodiments of the invention, the method for generating the pre-constructed detection model comprises:
classifying the node based on an object type of the node;
and training each type of node based on a multi-center clustering method to obtain the detection model.
According to the computer-readable storage medium of the embodiment of the present invention, the computer-readable storage medium stores an implementation program of information transfer, and the program, when executed by a processor, implements the steps of the above-mentioned method for detecting a malicious PDF document.
According to the computer-readable storage medium of the embodiment of the invention, by executing the detection method of the malicious PDF document, two-stage detection is carried out based on the structure and the content of the PDF document, and the accuracy and the reliability of malicious PDF detection are effectively improved.
An electronic device according to an embodiment of the present invention includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of detecting a malicious PDF document as described above.
According to the electronic equipment provided by the embodiment of the invention, by executing the detection method of the malicious PDF document, two-stage detection is carried out based on the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved.
Drawings
Fig. 1 is a flowchart of a method of detecting a malicious PDF document according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of generating a structure matrix according to an embodiment of the invention;
FIG. 3 is a flow diagram of a method of generating characterization data according to an embodiment of the present invention;
FIG. 4 is a flow diagram of a method of node character set replacement according to an embodiment of the present invention;
FIG. 5 is a flow diagram of a method of feature classification according to an embodiment of the invention;
FIG. 6 is a flow diagram of a detection model generation method according to an embodiment of the invention;
fig. 7 is a schematic structural diagram of a malicious PDF document detection model according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a convolutional neural network algorithm according to an embodiment of the present invention.
Detailed Description
To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.
As shown in fig. 1, the method for detecting a malicious PDF document according to an embodiment of the present invention includes:
s101: extracting a tree structure of the PDF document, and generating a structure matrix based on the tree structure;
it should be noted that, by extracting the tree structure of the PDF document, a structure matrix can be generated from the tree structure. Thereby, the characteristic information on the structure of the PDF document can be acquired.
S102: performing feature extraction on object contents of nodes of the tree structure to obtain feature data;
feature data is obtained by extracting features of object contents of nodes of the tree structure, and thereby feature information on contents of the PDF document can be acquired.
S103: inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;
by inputting the feature data into a pre-constructed detection model, the feature information of the PDF content can be classified, and a feature analysis result of the PDF content can be obtained.
S104: combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network;
that is, after the structural features and the content features of the PDF document are acquired, the structural features and the content features are combined as an input to the convolutional neural network. Therefore, the accuracy and reliability of detection of the malicious PDF document can be improved.
S105: and the convolutional neural network outputs the detection result of the PDF document.
In the present application, the execution sequence of the steps S101 to S105 is not limited. That is, in the present application, it is not necessary to execute the operations in order of S101 to S105.
According to the detection method of the malicious PDF document, disclosed by the embodiment of the invention, two-stage detection is carried out on the basis of the structure and the content of the PDF document, so that the accuracy and the reliability of malicious PDF detection are effectively improved. The characteristic information of the PDF document in the aspect of the structure can be obtained by extracting the PDF tree structure and generating the structure matrix. By extracting the characteristics of the node contents, the characteristic information of the PDF document in the aspect of the contents can be obtained. Moreover, the feature information of the content aspect can be processed by a detection model to obtain a classification result. And finally, combining the classification result and the structural characteristics into an expansion matrix to be input into a convolutional neural network to obtain a detection result of the PDF document.
As shown in fig. 2, according to some embodiments of the present invention, extracting a tree structure of a PDF document and generating a structure matrix based on the tree structure includes:
s201: extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;
s202: and classifying the nodes according to the object types of the nodes, and converting the adjacency matrix into a structural matrix based on the classification result.
Note that the PDF file has a tree-like logical structure, which is constituted by relationships between various basic objects. In the related art, the structural path feature, i.e., the vertical relationship from the root node to the leaf nodes, is focused, but the horizontal relationship is not noticed. The parallelism between nodes sharing the same parent or ancestor node will be lost. The invention takes comprehensive consideration of vertical and horizontal connections. The structure of a PDF file can be described by an adjacency matrix, which is a classical tool for describing graphs.
To extract local features of the structural matrix in the horizontal and vertical directions, we apply Convolutional Neural Networks (CNN) to the classification of PDF files. CNN achieves the most advanced performance in image classification, and can extract local features of an image using a convolution kernel. Furthermore, many techniques are applied to enhance the robustness of the classifier, and the nature of the PDF format limits the degrees of freedom of the elements in the structural matrix, which makes the classifier more robust. The structural matrix and the image are both two-dimensional arrays on the structure, so the convolutional neural network can capture the relation of the file structure in the horizontal and vertical directions due to the similarity of the structural matrix and the image.
The node objects of the PDF document include a plurality of types such as a font object, a page object, and the like. After the adjacency matrix is generated based on the tree structure, classification and combination can be performed according to the object types of the nodes, and the same and similar objects are combined into the same class. Similar objects as described herein may be understood as functionally similar objects, e.g., objects of different fonts may all be grouped into the same class. After the classification and merging processes, the dimension reduction can be performed on the adjacency matrix to obtain a structural matrix.
The extraction of the structural matrix is shown by the following algorithm 1:
algorithm 1 structural feature extraction
Figure BDA0002136625800000071
The PDF structure is first represented as a contiguous matrix, then types with similar functionality are merged into one, and then types with low frequencies are filtered. And finally, forming the structural matrix by the Cartesian products of the selected types. Therefore, the data processing amount of PDF document detection can be reduced, and the detection efficiency is improved.
In some embodiments of the present invention, as shown in fig. 3, performing feature extraction on object contents of nodes in a tree structure to obtain feature data includes:
s301: and replacing the object content of the node according to a preset rule to obtain replacement data.
It should be noted that, after the contents of the object are decrypted and decompressed, the contents of the node object can be obtained, and the contents of the object are replaced, so that the dimension of the data can be effectively reduced, which is beneficial to reducing the calculation amount of the PDF detection method and improving the detection efficiency of the PDF detection method.
S302: extracting the characteristics of the replacement data by adopting a language model;
for example, in some embodiments of the present invention, n-gram models may be employed to perform feature extraction on the replacement data. Wherein, the "n-gram model" is a mature language model in the field, and the specific implementation manner is not described herein again.
S303: the features are selected based on their frequency of occurrence to generate feature data.
It should be noted that the features may be selected according to the frequency of occurrence of the features. For example, a feature whose frequency of occurrence exceeds a threshold may be selected as the feature data. Thus, the feature data can be reduced in dimension.
For example, the following algorithm 2 can be adopted for feature extraction of JavaScript codes in PDF documents:
algorithm 2 feature extraction algorithm for Javascript
Figure BDA0002136625800000081
The JavaScript code is extracted from the sample and first processed according to the algorithm described above. Next, an n-gram method is applied to the replacement sequence and features for classification are generated. A threshold is then set to filter features in the training dataset that occur less frequently. The specific method is as shown in the above algorithm. And training the selected features by using a random deep forest and a support vector machine model to obtain a classifier model with high classification accuracy.
According to some embodiments of the present invention, as shown in fig. 4, replacing the object content of the node according to a preset rule to obtain replacement data includes:
s401: classifying a character set of contents of the nodes;
s402: and establishing mapping characters corresponding to each type of character set, and replacing the characters in each type of character set with the mapping characters.
In some embodiments of the invention, the number of mapping characters is less than 30. Thus, by substitution, the number of character types can be made smaller than 30, and the features are less sensitive to changes in code confusion, which not only improves robustness but also reduces the difficulty of feature dimension reduction.
It should be noted that, in the related art, n-gram analysis is directly performed on the byte sequence of the malware, but the application of the n-gram is meaningless due to the complex file format and coding. As n increases, the feature size explodes rapidly. For example, when n is 3, there are over two million features, which brings difficulties to feature selection and dimensionality reduction.
In addition, if the above method is applied, the modification of one character will likely result in many feature changes, which increases the sensitivity of feature vector values and reduces feature stability and robustness, and thus, the trained classifier model is easily escaped by simple code obfuscation.
In the present invention, in order to reduce the character space and reduce the influence of code confusion, the character set is classified and replaced with a type. The visible ASCII character set is classified, and a mapping is established for reducing the feature space of the visible character set from 128 to below 30.
For example, in the present application, the following replacement rules may be adopted to replace the character set of the content of the JavaScript code in the node:
type (B) Examples of the invention By replacement with
Blank character \n\r\t none
Capital letter A-Z A
Lower case letters a-z B
Number of 0-9 C
Small bracket () D
Middle bracket [] E
Brace bracket {} F
Comparison ><<=>=== G
Separator symbol ,,.:; H
Keyword if else while for I
Operations +-+=-==… J
Logical operations &&||and or K
Quotation mark ’” L
It can be understood that, by adopting the above replacement rule to replace the contents of the node, the purposes of reducing dimension and reducing the sensitivity of the change of the characteristic value caused by the change of a single character can be achieved.
In some embodiments of the present invention, as shown in fig. 5, the feature data is input into a pre-constructed detection model process to obtain a classification result, including:
s501: determining the clustering number and the clustering center by adopting a clustering method;
it should be noted that the clustering method may adopt a multi-center clustering method mature in the art, and the clustering number and the clustering center may be determined by the clustering method. The specific implementation process is a conventional technical means in the art, and is not described herein again.
S502: calculating the distance between the characteristic data and the clustering center of the corresponding category;
s503: and obtaining a classification result according to the distance.
That is, after the number of clusters and the cluster centers are acquired, the classification result of the feature data can be obtained by calculating the distance of the cluster centers of the type to which the feature data belongs. Therefore, the robustness of the classifier can be improved on the premise of high PDF classification accuracy. Even if an attacker can know partial design details of the classifier, the difficulty of the attacker in manufacturing the escape sample can be greatly increased, and the influence of hostile attack on the classifier is reduced.
According to some embodiments of the invention, as shown in fig. 6, a method for generating a pre-constructed detection model includes:
s601: classifying the nodes based on their object types;
s602: and training each type of node based on a multi-center clustering method to obtain a detection model.
It is noted that the objects are extracted after decoding and decryption and are first organized individually according to their type (e.g./Catalog,/Action, etc.). Unlike JavaScript code, it is difficult to determine whether malicious code is hidden therein because the number of malicious objects is very small compared to the number of objects of the entire file, and a large amount of manual identification is required to locate the malicious objects. Therefore, feature extraction is only applied to benign datasets. The feature extraction process for each type of object is shown in the following algorithm 3:
algorithm 3 feature extraction algorithm for different types of objects.
Figure BDA0002136625800000101
Figure BDA0002136625800000111
The basic steps are essentially the same as in algorithm 1 above, with the differences only running on benign sample sets. Further, according to the guidelines, features such as entropy are added for each type as redundant features for verification.
An anomaly detection model of a multi-center based clustering algorithm is trained for each type of object using the extracted features. Since the object content varies according to its function, multi-center clustering must be trained instead of a class of support vector machines (OSVM). The feature vector values are clustered using an algorithm like K-means and their distance to the center of the class to which they belong is calculated. The quantile of the distance is then determined as an indicator of the detection of the outlier.
It should be noted that the classification model for traditionally classified objects, such as image recognition, credit evaluation, etc., has a default assumption that the training data and the actual data share an approximate probability density distribution over their selected features. This assumption is easily satisfied because the training data is collected from the real world and the concept drift is small. However, due to the military competition between attackers and maintainers, the situation has changed when machine learning is applied to the field of network security such as malicious document classification. An attacker will manipulate the malicious sample to approximate the benign sample under the selected features without affecting the malicious functionality. This results in a classifier that is less robust and that has a reduced accuracy when the data set is replaced.
The invention proposes three guiding principles of feature selection:
causality: causality is used to measure the relationship between class labels and selected features. In general, features with high correlation are preferred in the training process. It is not a problem at the usual task, since features with high relevance can help the classifier to build a high-precision classification model without a competing attack. However, certain features are less causal to class labels. For example, the number of people drowning in a swimming pool is highly correlated with the average ice cream consumption, but causality between them is low because they are all caused by high temperatures. In the field of network security, features such as structural paths and metadata selected by classification systems like pdrate and Hidost have high relevance, but are less causal to class labels. Manual analysis finds that these functions are not necessarily related to the degree of maliciousness of the PDF. The methods realize high precision of more than 99 percent, and under the attack of EvaDemL, the accuracy is rapidly reduced to be close to 0 percent. Features such as shellcode, heap spray and JavaScript obfuscation are highly causal to the maliciousness of the sample, as they are essential requirements for functional implementation. Finding features with high causal relationships by class labels can be difficult, but deleting features with low causal relationships is relatively simple.
Impact resistance: an attacker attempts to modify a malicious sample to function close to a benign sample to evade the PDF malware classifier. To increase the cost of the attacker, we tend to choose features that are difficult to counterfeit, which are called crash resistance. In cryptography, collision resistance means that when a one-way hash function f (x) and a message m are given, it is difficult to find another message n that satisfies the condition f (m) ═ f (n). This concept is incorporated into the feature selection herein. Given a neighborhood δ, the feature extraction function f (x), the benign sample b, it is difficult to find a malicious sample m that satisfies the condition d (f (m), f (n) < δ, where d (x) is a measure of the distance of the measurement vector, such as L1, L2, or L ∞. The high collision resistance requires a one-way feature extraction function, and feature vectors are easily obtained from PDF samples, but corresponding content is difficult to recover from the feature vectors.
Redundancy: when oriented to high-dimensional feature data, conventional machine learning algorithms tend to eliminate redundant features and retain relatively independent features through PCA (principal component analysis) and other dimension reduction methods prior to training. High dimensional data leads to data sparsity, and training the model increases overfitting, which negatively impacts generalization ability. However, an attacker always tries to obtain information about the classifier and modify the malicious sample accordingly. Assuming that an attacker cannot obtain all feature information, the present invention adds additional features to detect whether there is a potential attack on the classifiers, which are referred to as redundant features. The intuition of the redundancy feature comes from the Cyclic Redundancy Check (CRC) in the field of data communications. The CRC is a data transmission error detection function that performs polynomial calculation on data and appends the result to a frame. The receiving device also executes to check whether the data has been modified to ensure integrity. In the field of feature selection, the present invention proposes the concept of feature redundancy. The polynomial calculation of the other (partial) feature values will be considered as an additional feature of the existing feature set. When only partial feature values are masked, the redundant features will show a large difference. Functional redundancy provides verification of the original function, thus increasing the computational complexity of an attacker who cannot know all functional knowledge.
In addition, in the present application, the Javascript code may be directly detected, including:
replacing the code content according to rules;
extracting features by using an n-gram method, and selecting and generating the features according to an instructive principle;
the classification model is trained based on the supervised machine learning algorithm, and the accuracy rate of malicious Javascript codes which can be effectively extracted can be almost 100%. A classifier using content-based features can effectively distinguish the types of vulnerabilities used.
Therefore, by providing three guiding principles of feature selection, fully utilizing structural features and content features and the local correlation of the transverse and longitudinal relations of different types of features, training a classifier based on a two-stage machine learning algorithm and Javascript codes, and enhancing the robustness of the classifier on the basis of keeping higher accuracy.
An electronic device according to an embodiment of the present invention includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method for detecting a malicious PDF document as described above.
According to the electronic device of the embodiment of the invention, by adopting the method for detecting the malicious PDF document, the method comprises the following steps:
as shown in fig. 7, in the first stage, feature extraction is performed by using an n-gram method, feature selection and feature generation are performed according to guiding principles, and the structural features and the clustering features are used as training in the first stage.
Training an anomaly detection model based on a multi-center clustering method for each type of object;
merging the tree structures according to the similarity of the types and generating a structure matrix;
in the second phase, the above trained model is first applied to the input data set, and then combined with the model output of the structural matrix to form an extended structural matrix, which is used as input to the CNN algorithm. The structure of CNN has the ability to capture local features and maintain different types of connections according to spreading matrices, and the specific CNN structure used is shown in fig. 8. And combining the classification result of the training data through the first-stage model and the structure into an expansion matrix as the input of the convolutional neural network model and training.
By the method, the accuracy of identifying the general PDF document can be improved. In order to make more use of the content feature information, a classifier is trained on the dataset with the vulnerability number tag to classify vulnerabilities used in malicious files. Only a small fraction of the samples are used for training and the classifier is tested on the entire data set. The test result shows that the accuracy rate of classifier identification exceeds 97%.
According to the computer-readable storage medium of the embodiment of the invention, the computer-readable storage medium stores an implementation program of information transmission, and the program realizes the steps of the above-mentioned malicious PDF document detection method when executed by the processor.
According to the computer-readable storage medium of the embodiment of the invention, by executing the detection method of the malicious PDF document, two-stage detection is carried out based on the structure and the content of the PDF document, and the accuracy and the reliability of malicious PDF detection are effectively improved.
While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.

Claims (10)

1. A method for detecting a malicious PDF document is characterized by comprising the following steps:
extracting a tree structure of a PDF document, and generating a structure matrix based on the tree structure;
performing feature extraction on the object content of the nodes of the tree structure to obtain feature data;
inputting the characteristic data into a pre-constructed detection model for processing to obtain a classification result;
combining the classification result and the structural matrix into an extended matrix and inputting the extended matrix into a convolutional neural network;
and the convolutional neural network outputs the detection result of the PDF document.
2. The method for detecting the malicious PDF document according to claim 1, wherein the extracting a tree structure of the PDF document and generating a structure matrix based on the tree structure comprises:
extracting a tree structure of the PDF document, and generating an adjacency matrix based on the tree structure;
and classifying the nodes according to the object types of the nodes, and converting the adjacency matrix into the structural matrix based on the classification result.
3. The method for detecting a malicious PDF document according to claim 1, wherein said extracting the features of the object contents of the nodes of the tree structure to obtain feature data comprises:
replacing the object content of the node according to a preset rule to obtain replacement data;
extracting the characteristics of the replacement data by adopting a language model;
the features are selected based on their frequency of occurrence to generate feature data.
4. The method for detecting the malicious PDF document according to claim 3, wherein the step of replacing the object content of the node according to a preset rule to obtain the replacement data comprises:
classifying a character set of the contents of the node;
and establishing mapping characters corresponding to each type of character set, and replacing characters in each type of character set by the mapping characters.
5. The method for detecting the malicious PDF document according to claim 4, wherein the number of the mapping characters is less than 30.
6. The method according to claim 3, wherein the language model is an n-gram model.
7. The method for detecting the malicious PDF document according to claim 1, wherein the feature data is input into a detection model which is constructed in advance and processed to obtain a classification result, and the method comprises the following steps:
determining the clustering number and the clustering center by adopting a clustering method;
calculating the distance between the characteristic data and the clustering center of the corresponding category;
and obtaining the classification result according to the distance.
8. The method for detecting the malicious PDF document according to claim 1, wherein the method for generating the pre-constructed detection model comprises the following steps:
classifying the node based on an object type of the node;
and training each type of node based on a multi-center clustering method to obtain the detection model.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an information transfer implementing program, which when executed by a processor implements the steps of the method for detecting a malicious PDF document according to any one of claims 1 to 8.
10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of detecting a malicious PDF document according to any one of claims 1 to 8.
CN201910655086.4A 2019-07-19 2019-07-19 Malicious PDF document detection method and electronic equipment Active CN112241530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910655086.4A CN112241530B (en) 2019-07-19 2019-07-19 Malicious PDF document detection method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910655086.4A CN112241530B (en) 2019-07-19 2019-07-19 Malicious PDF document detection method and electronic equipment

Publications (2)

Publication Number Publication Date
CN112241530A true CN112241530A (en) 2021-01-19
CN112241530B CN112241530B (en) 2023-05-30

Family

ID=74167470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910655086.4A Active CN112241530B (en) 2019-07-19 2019-07-19 Malicious PDF document detection method and electronic equipment

Country Status (1)

Country Link
CN (1) CN112241530B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883376A (en) * 2021-02-22 2021-06-01 深信服科技股份有限公司 File processing method, device, equipment and computer readable storage medium
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN113688386A (en) * 2021-07-26 2021-11-23 中国人民解放军陆军工程大学 Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document
CN113704757A (en) * 2021-07-26 2021-11-26 中国人民解放军陆军工程大学 Feature aggregation-based intelligent detection method and system for malicious PDF (Portable document Format) documents
CN113886438A (en) * 2021-12-08 2022-01-04 济宁景泽信息科技有限公司 Artificial intelligence-based achievement transfer transformation data screening method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827101A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Information asset protection method based on credible isolated operating environment
CN108881101A (en) * 2017-05-08 2018-11-23 腾讯科技(深圳)有限公司 A kind of cross site scripting loophole defence method, device and client based on DOM Document Object Model
CN108920953A (en) * 2018-06-16 2018-11-30 温州职业技术学院 A kind of malware detection method and system
CN108985064A (en) * 2018-07-16 2018-12-11 中国人民解放军战略支援部队信息工程大学 A kind of method and device identifying malice document
CN108985060A (en) * 2018-07-04 2018-12-11 中共中央办公厅电子科技学院 A kind of extensive Android Malware automated detection system and method
CN109190371A (en) * 2018-07-09 2019-01-11 四川大学 A kind of the Android malware detection method and technology of Behavior-based control figure
US20190026466A1 (en) * 2017-07-24 2019-01-24 Crowdstrike, Inc. Malware detection using local computational models
US20190065744A1 (en) * 2017-08-29 2019-02-28 Target Brands, Inc. Computer security system with malicious script document identification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827101A (en) * 2010-04-20 2010-09-08 中国人民解放军理工大学指挥自动化学院 Information asset protection method based on credible isolated operating environment
CN108881101A (en) * 2017-05-08 2018-11-23 腾讯科技(深圳)有限公司 A kind of cross site scripting loophole defence method, device and client based on DOM Document Object Model
US20190026466A1 (en) * 2017-07-24 2019-01-24 Crowdstrike, Inc. Malware detection using local computational models
US20190065744A1 (en) * 2017-08-29 2019-02-28 Target Brands, Inc. Computer security system with malicious script document identification
CN108920953A (en) * 2018-06-16 2018-11-30 温州职业技术学院 A kind of malware detection method and system
CN108985060A (en) * 2018-07-04 2018-12-11 中共中央办公厅电子科技学院 A kind of extensive Android Malware automated detection system and method
CN109190371A (en) * 2018-07-09 2019-01-11 四川大学 A kind of the Android malware detection method and technology of Behavior-based control figure
CN108985064A (en) * 2018-07-16 2018-12-11 中国人民解放军战略支援部队信息工程大学 A kind of method and device identifying malice document

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SATIA HERFERT等: "Automatically reducing tree-structured test inputs", 《 017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE)》 *
文伟平等: "基于机器学习的恶意文档识别工具设计与实现", 《信息网络安全》 *
杜学绘等: "基于混合特征的恶意PDF文档检测", 《通信学报》 *
陈亮等: "基于结构路径的恶意PDF文档检测", 《计算机科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883376A (en) * 2021-02-22 2021-06-01 深信服科技股份有限公司 File processing method, device, equipment and computer readable storage medium
CN113378156A (en) * 2021-07-01 2021-09-10 上海观安信息技术股份有限公司 Malicious file detection method and system based on API
CN113378156B (en) * 2021-07-01 2023-07-11 上海观安信息技术股份有限公司 API-based malicious file detection method and system
CN113688386A (en) * 2021-07-26 2021-11-23 中国人民解放军陆军工程大学 Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document
CN113704757A (en) * 2021-07-26 2021-11-26 中国人民解放军陆军工程大学 Feature aggregation-based intelligent detection method and system for malicious PDF (Portable document Format) documents
CN113886438A (en) * 2021-12-08 2022-01-04 济宁景泽信息科技有限公司 Artificial intelligence-based achievement transfer transformation data screening method
CN113886438B (en) * 2021-12-08 2022-03-15 济宁景泽信息科技有限公司 Artificial intelligence-based achievement transfer transformation data screening method

Also Published As

Publication number Publication date
CN112241530B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Liu et al. Automatic malware classification and new malware detection using machine learning
CN112241530B (en) Malicious PDF document detection method and electronic equipment
Venkatraman et al. A hybrid deep learning image-based analysis for effective malware detection
CN110765458B (en) Malicious software image format detection method and device based on deep learning
Serpen et al. Host-based misuse intrusion detection using PCA feature extraction and kNN classification algorithms
Singh et al. Malware classification using image representation
Wang et al. Abstracting massive data for lightweight intrusion detection in computer networks
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify the DGAs at scale
Horng et al. A novel intrusion detection system based on hierarchical clustering and support vector machines
JP5183483B2 (en) Method and apparatus used for automatic comparison of data strings
Xue et al. Malware classification using probability scoring and machine learning
Zhao et al. A review of computer vision methods in network security
US20060026675A1 (en) Detection of malicious computer executables
CN112329012B (en) Detection method for malicious PDF document containing JavaScript and electronic device
Khan et al. Identifying generic features for malicious url detection system
Kakisim et al. Sequential opcode embedding-based malware detection method
Yan et al. Automatic malware classification via PRICoLBP
Hwang et al. Semi-supervised based unknown attack detection in EDR environment
Riera et al. Prevention and fighting against web attacks through anomaly detection technology. A systematic review
He et al. Detection of Malicious PDF Files Using a Two‐Stage Machine Learning Algorithm
Liu et al. Fewm-hgcl: Few-shot malware variants detection via heterogeneous graph contrastive learning
Tsai et al. PowerDP: de-obfuscating and profiling malicious PowerShell commands with multi-label classifiers
Pevny et al. Nested multiple instance learning in modelling of HTTP network traffic
Shrivastava et al. Adalward: a deep-learning framework for multi-class malicious webpage detection
Patil et al. Learning to Detect Phishing Web Pages Using Lexical and String Complexity Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant