CN113987500A - Malicious PDF document detection method and device and electronic equipment - Google Patents

Malicious PDF document detection method and device and electronic equipment Download PDF

Info

Publication number
CN113987500A
CN113987500A CN202111328921.7A CN202111328921A CN113987500A CN 113987500 A CN113987500 A CN 113987500A CN 202111328921 A CN202111328921 A CN 202111328921A CN 113987500 A CN113987500 A CN 113987500A
Authority
CN
China
Prior art keywords
keywords
feature
malicious
pdf document
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111328921.7A
Other languages
Chinese (zh)
Inventor
徐晓
薛智慧
余小军
黄娜
李建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202111328921.7A priority Critical patent/CN113987500A/en
Publication of CN113987500A publication Critical patent/CN113987500A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/565Static detection by checking file integrity

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a malicious PDF document detection method, a malicious PDF document detection device and electronic equipment, which are applied to the technical field of network security and solve the problem that the detection result of a malicious PDF document is inaccurate due to complete information for detection in the prior art, wherein the malicious PDF document detection method comprises the following steps: obtaining A characteristic keywords of a plaintext object in a portable document format PDF document; acquiring root node information of a PDF document; determining B characteristic keywords included in the stream object indicated by the root node information; aiming at the M characteristic keywords, identifying malicious characteristic keywords to obtain a plurality of malicious characteristic keywords; and determining the plurality of malicious feature keywords as training samples of a malicious PDF document recognition model.

Description

Malicious PDF document detection method and device and electronic equipment
Technical Field
The present disclosure relates to the field of network security technologies, and in particular, to a method and an apparatus for detecting a malicious PDF document, and an electronic device.
Background
With the rapid development of the internet and the increasing popularization of office automation, a Portable Document Format (PDF) has become an open standard for global electronic Document distribution, and due to the high practicability and universal adaptability of PDF documents, if the PDF documents are used as carriers by malicious codes, the rapid propagation of the malicious codes is accelerated, so that detecting and defending documents containing the malicious codes has become an important target in the field of computer security. In the prior art, a scanning tool is used for extracting metadata in a PDF document for detection, but the detection of the metadata by the method is a detection on a direct object level, and indirect objects contained in the PDF document are not considered to be detected, so that the detection information is incomplete, and the detection result of a malicious PDF document is inaccurate.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, the present disclosure provides a malicious PDF document detection method, device and electronic device.
In a first aspect, the present disclosure provides a method for detecting a malicious PDF document, including:
obtaining A characteristic keywords of a plaintext object in a portable document format PDF document;
acquiring root node information of a PDF document;
determining B characteristic keywords included in the stream object indicated by the root node information;
aiming at M characteristic keywords, identifying the malicious characteristic keywords to obtain a plurality of malicious characteristic keywords, wherein the M characteristic keywords comprise: a characteristic keywords and B characteristic keywords;
and determining the plurality of malicious feature keywords as training samples of a malicious PDF document recognition model.
Optionally, identifying the malicious feature keywords for the M feature keywords to obtain a plurality of malicious feature keywords includes:
aiming at the M characteristic keywords, calculating the first occurrence frequency of each characteristic keyword in the PDF document;
determining a plurality of first feature keywords of which the occurrence times are greater than or equal to a preset time from the M feature keywords;
and circularly identifying the malicious characteristic keywords aiming at the first characteristic keywords until a plurality of malicious characteristic keywords are obtained.
Optionally, the identifying the malicious feature keywords by aiming at the plurality of first feature keywords circularly until the plurality of malicious feature keywords are obtained includes: and circularly executing the following steps aiming at the first characteristic keywords to identify the malicious characteristic keywords until the obtained remaining characteristic keywords are the malicious characteristic keywords:
determining at least one feature group according to the associated values of any plurality of feature keywords in the plurality of first feature keywords, wherein each feature group comprises one feature keyword or a plurality of associated feature keywords;
taking the detection result as a dependent variable, taking T feature keywords in a target feature group as independent variables, and establishing a T element function according to the relation between the target feature group and the detection result, wherein the target feature group is the feature group with the largest number of feature keywords in at least one feature group;
obtaining a derivative of the target independent variable, and calculating the target occurrence frequency corresponding to the target characteristic keyword in the T characteristic keywords, wherein the target characteristic keyword is any one of the T characteristic keywords;
modifying the occurrence frequency of target feature keywords in the malicious PDF document into a second occurrence frequency, wherein the second occurrence frequency is different from the target occurrence frequency;
detecting whether the target characteristic keywords are malicious characteristic keywords or not;
and if the target characteristic keywords are not malicious characteristic keywords, deleting the target characteristic keywords from at least one characteristic group, and taking the remaining characteristic keywords as a plurality of first keywords.
Optionally, the dictionary of the stream object includes a decoding mode;
determining B feature keywords included in the stream object indicated by the root node information, including:
carrying out decryption or decompression processing on the stream object according to a decoding mode;
extracting indirect reference objects included in the stream object;
b feature keys are determined from the indirect referencing object.
Optionally, the dictionary of the stream object includes the original number of the indirect reference objects;
determining B feature keywords included in the stream object indicated by the root node information, including:
determining the current number of indirect reference objects contained in the stream object indicated by the root node information;
and if the current number is equal to the original number, acquiring B feature keywords included in the indirect reference object.
Optionally, the obtaining of the root node information of the PDF document includes:
acquiring a file tail label from a PDF document;
positioning the position of the file tail from the PDF document according to the file tail label;
acquiring file tail information based on the position of the file tail;
and determining root node information from the file tail information.
Optionally, the obtaining of the root node information of the PDF document includes:
acquiring a start mark from a PDF document;
determining a field corresponding to the start mark;
and determining the root node information from the field corresponding to the start mark.
In a second aspect, the present disclosure provides a malicious PDF document detection apparatus, including:
the acquisition module is used for acquiring A characteristic keywords of a plaintext object in a portable document format PDF document; acquiring root node information of a PDF document; determining B characteristic keywords included in the stream object indicated by the root node information;
the identification module is used for identifying the malicious feature keywords aiming at the M feature keywords to obtain a plurality of malicious feature keywords, wherein the M feature keywords comprise: a characteristic keywords and B characteristic keywords;
and the training module is used for determining the plurality of malicious characteristic keywords as training samples of the malicious PDF document identification model.
Optionally, the identification module is specifically configured to calculate, for the M feature keywords, a first occurrence number of each feature keyword in the PDF document;
determining a plurality of first feature keywords of which the occurrence times are greater than or equal to a preset time from the M feature keywords;
and circularly identifying the malicious characteristic keywords aiming at the first characteristic keywords until a plurality of malicious characteristic keywords are obtained.
Optionally, the identifying module is specifically configured to cyclically execute the following steps for the plurality of first feature keywords, so as to identify the malicious feature keywords until the obtained remaining feature keywords are the plurality of malicious feature keywords:
determining at least one feature group according to the associated values of any plurality of feature keywords in the plurality of first feature keywords, wherein each feature group comprises one feature keyword or a plurality of associated feature keywords;
taking the detection result as a dependent variable, taking T feature keywords in a target feature group as independent variables, and establishing a T element function according to the relation between the target feature group and the detection result, wherein the target feature group is the feature group with the largest number of feature keywords in at least one feature group;
obtaining a derivative of the target independent variable, and calculating the target occurrence frequency corresponding to the target characteristic keyword in the T characteristic keywords, wherein the target characteristic keyword is any one of the T characteristic keywords;
modifying the occurrence frequency of target feature keywords in the malicious PDF document into a second occurrence frequency, wherein the second occurrence frequency is different from the target occurrence frequency;
detecting whether the target characteristic keywords are malicious characteristic keywords or not;
and if the target characteristic keywords are not malicious characteristic keywords, deleting the target characteristic keywords from at least one characteristic group, and taking the remaining characteristic keywords as a plurality of first keywords.
Optionally, the dictionary of the stream object includes a decoding mode;
the acquisition module is specifically used for carrying out decryption or decompression processing on the stream object according to a decoding mode;
extracting indirect reference objects included in the stream object;
b feature keys are determined from the indirect referencing object.
Optionally, the dictionary of the stream object includes the original number of the indirect reference objects;
the acquisition module is specifically used for determining the current number of indirect reference objects contained in the stream object indicated by the root node information;
and if the current number is equal to the original number, acquiring B feature keywords included in the indirect reference object.
Optionally, the obtaining module is specifically configured to obtain a file end tag from a PDF document;
positioning the position of the file tail from the PDF document according to the file tail label;
acquiring file tail information based on the position of the file tail;
and determining root node information from the file tail information.
Optionally, the obtaining module is specifically configured to obtain a start mark from a PDF document;
determining a field corresponding to the start mark;
and determining the root node information from the field corresponding to the start mark.
In a third aspect, the present disclosure provides an electronic device comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a malicious PDF document detection method according to the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements a malicious PDF document detection method according to the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: according to the method, all the characteristic keywords in the PDF document including the compressed or encrypted characteristic keywords contained in the flow object are extracted by obtaining the characteristic keywords contained in the flow object and the characteristic keywords contained in the plaintext object in the PDF document so as to identify the malicious characteristic keywords, so that parameters used for malicious PDF document identification model training are more complete, information used for malicious PDF document detection is more complete, and the accuracy of malicious PDF document detection is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a first schematic diagram illustrating a malicious PDF document detection method according to an embodiment of the present disclosure;
fig. 2 is a second schematic diagram illustrating a malicious PDF document detection method according to an embodiment of the present disclosure;
fig. 3 is a structural diagram of a malicious PDF document detection device according to an embodiment of the present disclosure;
fig. 4 is a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
To more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the following will briefly introduce terms used in the description of the embodiments or the prior art:
PDF object: the PDF objects include a direct object and an indirect object, wherein the direct object includes: boolean, numeric, string, name, array, dictionary;
indirect object: a content stream, a stream object (object stream, object) and an indirectly referenced object (object), wherein the stream object is composed of a dictionary and a set of keywords stream and endstream immediately following the dictionary, and the set of keywords comprises a series of bytes in the middle; the stream object may encrypt or compress and store one or more PDF objects.
File trailer (trailer): part of the physical structure of a PDF document, including how an application reading the PDF document should find cross-reference tables and special objects, the dictionary in the trailer includes Root key-value pairs.
Root, Root node, an object number indicating the logical entry point of the file, e.g. < < Root 10R > > indicates parsing from the first stream object.
In the prior art, a scanning tool is used for extracting metadata in a PDF document for detection, but the detection of the metadata by the method is a detection on a direct object level, and indirect objects contained in the PDF document are not considered to be detected, so that the detection information is incomplete, and the detection result of a malicious PDF document is inaccurate.
In order to solve the above problems, according to the present disclosure, by obtaining feature keywords included in a stream object and feature keywords included in a plaintext object in a PDF document, all feature keywords included in the compressed or encrypted stream object in the PDF document are extracted to identify malicious feature keywords, so that parameters used for training a malicious PDF document identification model are more complete, information used for malicious PDF document detection is more complete, and accuracy of malicious PDF document detection is improved.
The malicious PDF document detection method described in the embodiment of the disclosure can be applied to a malicious PDF document detection device or an electronic device, wherein the malicious PDF document detection device can be a functional module and/or a functional entity which can realize the malicious PDF document detection method in the electronic device.
The electronic device may include: examples of such Devices include smart phones (e.g., Android phones, IOS phones, Windows Phone phones, etc.), tablet computers, palmtop computers, notebook computers, Mobile Internet Devices (MID, Mobile Internet Devices), and wearable Devices, which are not exhaustive, but include, but are not limited to, the above Devices.
Fig. 1 is a method for detecting a malicious PDF document in the embodiment of the present disclosure, where the method includes:
s110, obtaining A characteristic keywords of a plaintext object in the portable document format PDF document.
In some embodiments, the plaintext object is an object that can be directly read, unencrypted or compressed in the PDF document, such as boolean data, numeric data, string data, typeface data, array data, and word representative data in the PDF document. Wherein the feature keywords include: JavaScript, JS, Acroform, XFA, Launch, GoTo, OpenAction, AA, EmbeddFile, ObjStm, the keywords are special keywords which may be attacked, including but not limited to the above characteristic keywords, and the disclosure does not limit the keywords.
In some embodiments, exemplary feature keywords further include, but are not limited to: the feature keywords are conventional keywords in the PDF document, and the possibility of being attacked cannot be completely eliminated, so that in order to ensure the integrity of the detection information of the malicious PDF document, the conventional keywords in the PDF document need to be extracted as the feature keywords.
The characteristic keywords of the plaintext object in the PDF document are obtained and are used as part of characteristics detected by a malicious PDF document to be identified, wherein the characteristic keywords comprise special keywords which are possible to be attacked and conventional keywords.
And S120, acquiring the root node information of the PDF document.
Since the root node is a connection point of the physical structure and the logical structure of the PDF document and is also a starting point object indicating PDF document parsing, the following will discuss obtaining root node information of the PDF document from the physical structure and the logical structure, respectively:
(1) physical structure
In terms of the physical structure of the PDF document, in the process of acquiring the root node information of the PDF document, optionally, acquiring a file end tag from the PDF document; positioning the position of the file tail from the PDF document according to the file tail label; acquiring the file tail information based on the position of the file tail; and determining root node information from the file tail information.
In some embodiments, a file end tag "trailer" is first obtained from the PDF document, a position of a file end in the PDF document is located according to the file end tag, a dictionary in the file end is further obtained, a key value is determined as a key value pair of a Root node Root to determine Root node information, and a stream object is determined according to the Root node information.
(2) Logic structure
From the aspect of a logic structure of a PDF document, in the process of acquiring Root information of the PDF document, optionally, acquiring a start mark from the PDF document; determining a field corresponding to the starting mark; and determining root node information from a field corresponding to the start mark.
In some embodiments, a start flag XRef in the PDF document is first obtained, root node information is determined from a key value pair < Type, XRef > to which XRef belongs, and further, a stream object is determined according to the root node information. It should be noted that the above embodiment is implemented without locating the file end in the PDF document.
In some embodiments, the logic structure of the PDF document is constructed into a directed acyclic graph G (V, E), where V is a set of vertices in the graph and E is a directed edge connecting the vertices, and in the converted PDF directed acyclic graph, starting from an object referenced by a root node, when traversing to a node without a stream object, returning to a previous layer of the node, and continuing traversing other nodes; marking the traversed nodes to traverse all the nodes once; and repeating the process until all the node information is traversed.
S130, determining B characteristic keywords included in the stream object indicated by the root node information.
In some embodiments, after determining the stream object indicated by the root node information, a decoding manner of the stream object is determined according to a dictionary of the stream object, where the decoding manner includes: ASCIIHex, ASCII85, string table compression algorithm (Lempel-Ziv-Welch Encoding, LZW), Run Length Code (RunLength Code), Group3 international telegraph and telephone consultation Group Code (CCITT Group3), Group4 international telegraph and telephone consultation Group Code (CCITT Group4), image compression algorithm (Joint Photographic Experts Group, JPEG), and slate, where slate is the well-known ZIP compression algorithm. The decoding method for the stream object includes, but is not limited to, the above method, and the present disclosure does not limit this.
Further, the stream object is determined to be decoded based on the dictionary contained in the stream object, wherein the decoding mode is a decryption or decompression processing mode.
In some embodiments, after decoding the stream Object according to the decoding manner, an indirect reference Object (Object) included in the stream Object is extracted, wherein the Object is a direct Object or an indirect Object.
If the Object is an indirect Object, acquiring a dictionary of the Object, determining a decoding mode from the dictionary, decrypting or decompressing the Object according to the decoding mode to acquire content information of the Object, and further extracting characteristic keywords of the decrypted or decompressed content; and if the Object is a direct Object, directly reading the Object and acquiring the characteristic keywords in the Object.
After extracting all content information of the stream Object, the number of objects in the stream Object needs to be verified, and the integrity of the extracted stream Object is verified according to the key value pair information in the extracted dictionary, optionally, the original number of indirect reference objects is included in the dictionary of the stream Object; determining the current number of indirect reference objects contained in the stream object indicated by the root node information; and if the current number is equal to the original number, acquiring B feature keywords included in the indirect reference object.
In some embodiments, the dictionary of the stream object includes an original number N of the indirect reference objects compressed or decrypted by the stream object, the current number N1 of the indirect reference objects is determined after decryption or decompression processing is performed on the stream object, and if N1 is equal to N, it is determined that data in the stream object is complete after decryption or decompression processing, so as to extract feature keywords included in the indirect reference objects; if N1 is greater than N, then the source of the indirect referencing object beyond the original number cannot be determined; if N1 is less than N, the data of the indirect referencing object is incomplete, and it should be noted that, after decoding, the current number of the indirect referencing objects is not equal to the original number, and no processing is performed, which is not described herein.
After the number of the indirect reference objects in the stream object is verified, the stream object is analyzed, the indirect reference objects are traversed, a dictionary, a list and a byte stream of the indirect reference objects are further extracted, all key value pair information, list content information and information in the byte stream in the dictionary are extracted, so that complete stream object information is obtained, and feature keywords are further extracted from the information.
After the feature keywords are acquired, the number of occurrences of each feature keyword in the PDF document is calculated and recorded using a dictionary structure. Illustratively, the special keyword JavaScript appears 3 times, and the regular keyword Names appears 2 times.
By extracting the feature keywords in the compressed or encrypted stream object in the PDF document, the extraction of all the feature keywords in the PDF document is realized, so that the information for detecting the malicious PDF document is more complete, and the accuracy is improved.
S140, recognizing the malicious feature keywords aiming at the M feature keywords to obtain a plurality of malicious feature keywords.
Wherein, M feature keywords comprise: a feature keys and B feature keys. According to the method and the device, A characteristic keywords of the plaintext object in the PDF document are obtained, and B characteristic keywords of the compressed or encrypted stream object in the PDF document are also obtained, so that data for identifying malicious characteristic keywords are more complete.
In some embodiments, in the process of identifying M feature keywords, the number of occurrences of each feature keyword in the PDF document is calculated for the M feature keywords; determining a plurality of first feature keywords of which the occurrence times are greater than or equal to a preset time from the M feature keywords; and circularly identifying the malicious characteristic keywords aiming at the first characteristic keywords until a plurality of malicious characteristic keywords are obtained.
The preset times are the preset occurrence times of the feature keywords, the occurrence times of the feature keywords are larger than or equal to the preset times, the feature keywords are valuable and can be used for detecting malicious PDF documents, and if the occurrence times of the feature keywords are smaller than the preset times, the method is not processed by the method.
Illustratively, if the number of occurrence times of JavaScript is 3, the number of occurrence times of JS is 3, the number of occurrence times of Names is 2, and the preset number of times is 3, it is determined that the feature keywords JavaScript and JS are valuable, and the method can be used for detecting malicious PDF documents.
Further, generating a feature list stores valuable feature keys, as shown in table 1.
Characteristic key word Number of occurrences
JavaScript 3
JS 3
TABLE 1
By setting the threshold value of the occurrence times, the feature keywords which can be used for detecting the malicious PDF document are screened out, and the accuracy of detecting the malicious PDF document is improved.
In some embodiments, the malicious feature keywords are identified for the plurality of first feature keywords in a circulating manner, a field containing the first feature keywords is operated in a virtual environment through dynamic detection, whether an operation result is attacked or not is judged, and if the operation result is attacked, the first feature keywords are determined to be the malicious feature keywords.
In the embodiment of the present disclosure, in the process of circularly performing identification of malicious feature keywords for a plurality of first feature keywords until obtaining a plurality of malicious feature keywords, the operations in S202 to S212 are performed:
s202, determining at least one characteristic group according to the correlation values of any plurality of characteristic keywords in the plurality of first characteristic keywords.
Wherein, each feature group comprises one feature keyword or a plurality of associated feature keywords. For example, "JavaScript" and "JS" are included in one feature set.
In some embodiments, the relevance value of any two feature keywords in the feature list is calculated by an algorithm, including but not limited to kendell algorithm, to perform relevance analysis, which is not limited by this disclosure.
In some embodiments, a correlation coefficient is set, the correlation coefficient being a threshold value of the correlation value; if the calculated correlation value of the two feature keywords is greater than or equal to the correlation coefficient, the features corresponding to the two feature keywords are considered to be correlated with each other and determined to be a feature group, and in practical application, a plurality of features are cross-correlated with each other and determined to be a feature group; and if the calculated correlation value of the two feature keywords is smaller than the correlation coefficient, namely the two features are not correlated, the two keywords are respectively a feature group.
S204, taking the detection result as a dependent variable, taking T feature keywords in the target feature group as independent variables, and establishing a T-element function according to the relation between the target feature group and the detection result.
Wherein, the detection result is decimal from 0 to 1.0, the detection result is greater than or equal to 0.5 and is determined as malicious, and the detection result is less than 0.5 and is determined as benign. The target feature group is a feature group with the largest number of feature keys in at least one feature group. After at least one characteristic group is determined according to the association values of any plurality of characteristic keywords in the plurality of first characteristic keywords, the maximum characteristic group with the largest number of characteristic keywords in the at least one characteristic group is determined. The number of the feature keywords of the maximum feature group is T, and T is a positive integer and is less than or equal to M.
S206, derivation is conducted on the target independent variables, and the target occurrence frequency corresponding to the target characteristic keywords in the T characteristic keywords is calculated.
Wherein, the target characteristic keyword is any one of the T characteristic keywords.
And taking the detection result as a dependent variable, and taking T feature keywords in the target feature group as independent variables, and determining the corresponding independent variable value when the dependent variable is minimum by using a gradient descent method.
It should be noted that the definition of the gradient is that a function calculates partial derivatives of all its independent variables, and a vector formed by these partial derivatives is the gradient of the function. The gradient descent method is to make the gradient of the function descend continuously, so that the gradient is equal to zero or infinitely close to zero to obtain the extreme value of the function, and then the lowest point of the dependent variable is determined, and the value of the corresponding independent variable can be obtained.
And S208, modifying the occurrence frequency of the target characteristic keywords in the malicious PDF document into a second occurrence frequency.
The malicious PDF document is a malicious sample for screening features. The second number of occurrences is different from the target number of occurrences.
S210, detecting whether the target characteristic keywords are malicious characteristic keywords.
And S212, if the target feature keywords are not malicious feature keywords, deleting the target feature keywords from at least one feature group, and taking the remaining feature keywords as a plurality of first keywords.
In some embodiments, the second occurrence number is a critical value for evasive detection of a malicious PDF document, and the malicious PDF document is disguised as a benign PDF document after the occurrence number of modifying target feature keywords is the second occurrence number, and if it is detected that the target feature keywords are not malicious feature keywords, it is indicated that the evasive detection of the malicious PDF document is successful, the target feature keywords need to be deleted, and the remaining feature keywords are used as a plurality of first keywords to perform operations S202 to S212; if the target feature keyword is detected to be a malicious feature keyword, determining that the target feature keyword is a malicious feature keyword, and executing S150.
It should be noted that the finally determined multiple malicious feature keywords at least include: JavaScript, JS, Acroform, XFA, Launch, GoTo, URI, OpenAction, AA, EmbeddFile, and ObjStm.
And screening keywords in the feature list by constructing a function and performing derivation, and removing feature keywords which can escape detection after the occurrence times are modified so as to obtain stable feature keywords for PDF document detection.
S150, determining the plurality of malicious feature keywords as training samples of the malicious PDF document recognition model.
In some embodiments, the plurality of malicious feature keywords are vectorized, for each malicious feature keyword of the plurality of malicious feature keywords: the occurrence times of the malicious feature keywords are mapped to be a numerical value, the number of the malicious feature keywords is used as the dimensionality of the feature vector, training is further carried out based on a logistic regression method, and a fitting avoidance module is added to obtain a classifier after machine learning training, wherein the classifier is used for detecting the malicious PDF document.
In summary, the present disclosure extracts all feature keywords in a PDF document, including feature keywords that can be directly read in a plaintext object and feature keywords that are compressed or encrypted in a stream object, on one hand, solves the problem in the prior art that malicious PDF documents escape detection due to incomplete extraction of feature keywords; on the other hand, all the characteristic keywords of the PDF document are obtained for machine learning training, so that the detection result of the classifier obtained by the machine learning training is more accurate;
in addition, a plurality of feature keywords are screened based on a threshold value, then the relevance between the feature keywords is calculated to determine a maximum feature group, then malicious PDF documents with modified feature keyword occurrence times are input, unstable features are removed, the stability of the classifier obtained after machine learning training is improved, and therefore the accuracy of malicious PDF document detection is improved.
Fig. 3 is a malicious PDF document detection device provided by the present disclosure, which includes:
an obtaining module 310, configured to obtain a feature keywords of a plaintext object in a portable document format PDF document; acquiring root node information of a PDF document; determining B characteristic keywords included in the stream object indicated by the root node information;
an identifying module 320, configured to perform identification on malicious feature keywords for the M feature keywords to obtain a plurality of malicious feature keywords, where the M feature keywords include: a characteristic keywords and B characteristic keywords;
the training module 330 is configured to determine the multiple malicious feature keywords as training samples of a malicious PDF document identification model.
Optionally, the identifying module 320 is specifically configured to calculate, for the M feature keywords, a first occurrence number of each feature keyword in the PDF document;
determining a plurality of first feature keywords of which the occurrence times are greater than or equal to a preset time from the M feature keywords;
and circularly identifying the malicious characteristic keywords aiming at the first characteristic keywords until a plurality of malicious characteristic keywords are obtained.
Optionally, the identifying module 320 is specifically configured to execute the following steps in a loop for the plurality of first feature keywords, so as to identify the malicious feature keywords until the obtained remaining feature keywords are the plurality of malicious feature keywords:
determining at least one feature group according to the associated values of any plurality of feature keywords in the plurality of first feature keywords, wherein each feature group comprises one feature keyword or a plurality of associated feature keywords;
taking the detection result as a dependent variable, taking T feature keywords in a target feature group as independent variables, and establishing a T element function according to the relation between the target feature group and the detection result, wherein the target feature group is the feature group with the largest number of feature keywords in at least one feature group;
obtaining a derivative of the target independent variable, and calculating the target occurrence frequency corresponding to the target characteristic keyword in the T characteristic keywords, wherein the target characteristic keyword is any one of the T characteristic keywords;
modifying the occurrence frequency of target feature keywords in the malicious PDF document into a second occurrence frequency, wherein the second occurrence frequency is different from the target occurrence frequency;
detecting whether the target characteristic keywords are malicious characteristic keywords or not;
and if the target characteristic keywords are not malicious characteristic keywords, deleting the target characteristic keywords from at least one characteristic group, and taking the remaining characteristic keywords as a plurality of first keywords.
Optionally, the dictionary of the stream object includes a decoding mode;
an obtaining module 310, specifically configured to perform decryption or decompression processing on the stream object according to a decoding manner;
extracting indirect reference objects included in the stream object;
b feature keys are determined from the indirect referencing object.
Optionally, the dictionary of the stream object includes the original number of the indirect reference objects;
an obtaining module 310, configured to specifically determine a current number of indirect reference objects included in the stream object indicated by the root node information;
and if the current number is equal to the original number, acquiring B feature keywords included in the indirect reference object.
Optionally, the obtaining module 310 is specifically configured to obtain a file end tag from a PDF document;
positioning the position of the file tail from the PDF document according to the file tail label;
acquiring file tail information based on the position of the file tail;
and determining root node information from the file tail information.
Optionally, the obtaining module 310 is specifically configured to obtain a start mark from a PDF document;
determining a field corresponding to the start mark;
and determining the root node information from the field corresponding to the start mark.
In summary, the present disclosure extracts all feature keywords in a PDF document, including feature keywords that can be directly read in a plaintext object and feature keywords that are compressed or encrypted in a stream object, on one hand, solves the problem in the prior art that malicious PDF documents escape detection due to incomplete extraction of feature keywords; on the other hand, all the characteristic keywords of the PDF document are obtained for machine learning training, so that the detection result of the classifier obtained by the machine learning training is more accurate;
in addition, a plurality of feature keywords are screened based on a threshold value, then the relevance between the feature keywords is calculated to determine a maximum feature group, then malicious PDF documents with modified feature keyword occurrence times are input, unstable features are removed, the stability of the classifier obtained after machine learning training is improved, and therefore the accuracy of malicious PDF document detection is improved.
It should be noted that, in the embodiment of the malicious PDF document detection apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.
As shown in fig. 4, an embodiment of the present disclosure provides an electronic device, including: the processor 401, the memory 402, and the computer program stored in the memory 402 and operable on the processor 401 may implement each process executed by the first terminal in the key distribution method by the processor 401, and may achieve the same technical effect, and are not described herein again to avoid repetition.
An embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process executed by a first terminal in the key distribution method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present disclosure.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A malicious PDF document detection method is characterized by comprising the following steps:
obtaining A characteristic keywords of a plaintext object in a portable document format PDF document;
acquiring root node information of a PDF document;
b characteristic keywords included in the stream object indicated by the root node information are determined;
aiming at M characteristic keywords, identifying the malicious characteristic keywords to obtain a plurality of malicious characteristic keywords, wherein the M characteristic keywords comprise: the A feature keywords and the B feature keywords;
and determining the plurality of malicious feature keywords as training samples of a malicious PDF document identification model.
2. The method according to claim 1, wherein the identifying the malicious feature keywords for the M feature keywords to obtain a plurality of malicious feature keywords comprises:
calculating the first occurrence number of each feature keyword in the PDF document aiming at the M feature keywords;
determining a plurality of first feature keywords of which the occurrence times are greater than or equal to a preset time from the M feature keywords;
and circularly identifying the malicious characteristic keywords aiming at the first characteristic keywords until the malicious characteristic keywords are obtained.
3. The method of claim 2, wherein the identifying malicious feature keywords for the first plurality of feature keyword loops until the malicious feature keywords are obtained comprises: circularly executing the following steps aiming at the first feature keywords to identify the malicious feature keywords until the obtained remaining feature keywords are the malicious feature keywords:
determining at least one feature group according to the associated values of any plurality of feature keywords in the first feature keywords, wherein each feature group comprises one feature keyword or a plurality of associated feature keywords;
taking the detection result as a dependent variable, taking T feature keywords in a target feature group as independent variables, and establishing a T element function according to the relation between the target feature group and the detection result, wherein the target feature group is the feature group with the largest number of feature keywords in at least one feature group;
obtaining a target independent variable, and calculating the target occurrence frequency corresponding to a target characteristic keyword in the T characteristic keywords, wherein the target characteristic keyword is any one of the T characteristic keywords;
modifying the occurrence frequency of the target feature keywords in the malicious PDF document into a second occurrence frequency, wherein the second occurrence frequency is different from the target occurrence frequency;
detecting whether the target characteristic keywords are malicious characteristic keywords or not;
and if the target characteristic keywords are not malicious characteristic keywords, deleting the target characteristic keywords from the at least one characteristic group, and taking the remaining characteristic keywords as the plurality of first keywords.
4. The method of claim 1, wherein the dictionary of stream objects includes decoding means;
the determining B feature keywords included in the stream object indicated by the root node information includes:
carrying out decryption or decompression processing on the stream object according to the decoding mode;
extracting indirect reference objects included in the stream object;
b feature keywords are determined from the indirect referencing object.
5. The method of claim 1, wherein the dictionary of stream objects includes an original number of indirect reference objects;
the determining B feature keywords included in the stream object indicated by the root node information includes:
determining the current number of indirect reference objects contained in the stream object indicated by the root node information;
and if the current number is equal to the original number, acquiring the B feature keywords included in the indirect reference object.
6. The method according to claim 1, wherein the obtaining root node information of the PDF document comprises:
acquiring a file tail label from the PDF document;
positioning the position of the file tail from the PDF document according to the file tail label;
acquiring the file tail information based on the position of the file tail;
and determining root node information from the file tail information.
7. The method according to claim 1, wherein the obtaining root node information of the PDF document comprises:
acquiring a start mark from the PDF document;
determining a field corresponding to the starting mark;
and determining root node information from a field corresponding to the start mark.
8. A malicious PDF document detection apparatus, comprising:
the acquisition module is used for acquiring A characteristic keywords of a plaintext object in a portable document format PDF document; acquiring root node information of a PDF document; b characteristic keywords included in the stream object indicated by the root node information are determined;
an identification module, configured to perform identification on malicious feature keywords for M feature keywords to obtain a plurality of malicious feature keywords, where the M feature keywords include: the A feature keywords and the B feature keywords;
and the training module is used for determining the plurality of malicious feature keywords as training samples of a malicious PDF document identification model.
9. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a malicious PDF document detection method according to any one of claims 1 to 7.
10. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements a malicious PDF document detection method according to any one of claims 1 to 7.
CN202111328921.7A 2021-11-10 2021-11-10 Malicious PDF document detection method and device and electronic equipment Pending CN113987500A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111328921.7A CN113987500A (en) 2021-11-10 2021-11-10 Malicious PDF document detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111328921.7A CN113987500A (en) 2021-11-10 2021-11-10 Malicious PDF document detection method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113987500A true CN113987500A (en) 2022-01-28

Family

ID=79747804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111328921.7A Pending CN113987500A (en) 2021-11-10 2021-11-10 Malicious PDF document detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113987500A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975865A (en) * 2023-08-11 2023-10-31 北京天融信网络安全技术有限公司 Malicious Office document detection method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975865A (en) * 2023-08-11 2023-10-31 北京天融信网络安全技术有限公司 Malicious Office document detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
KR102128649B1 (en) Encrypting and decrypting information
CN105718502B (en) Method and apparatus for efficient feature matching
CN109784056B (en) Malicious software detection method based on deep learning
US8908978B2 (en) Signature representation of data having high dimensionality
CN101807208B (en) Method for quickly retrieving video fingerprints
Breitinger et al. A fuzzy hashing approach based on random sequences and hamming distance
Roussev et al. File fragment encoding classification—An empirical approach
CN112052451A (en) Webshell detection method and device
Puglisi et al. Data compression and learning in time sequences analysis
US9122898B2 (en) Systems and methods for processing documents of unknown or unspecified format
CN113987500A (en) Malicious PDF document detection method and device and electronic equipment
CN105243327B (en) A kind of secure file processing method
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN116055067B (en) Weak password detection method, device, electronic equipment and medium
KR102246405B1 (en) TF-IDF-based Vector Conversion and Data Analysis Apparatus and Method
CN112926647A (en) Model training method, domain name detection method and device
CN110532456B (en) Case query method, device, computer equipment and storage medium
CN109359481B (en) Anti-collision search reduction method based on BK tree
CN116611092A (en) Multi-factor-based data desensitization method and device, and tracing method and device
Bakhshinejad et al. A new compression based method for android malware detection using opcodes
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN116186708A (en) Class identification model generation method, device, computer equipment and storage medium
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN114298236A (en) Unstructured content similarity determining method and device and electronic equipment
CN116414976A (en) Document detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination