CN114912114A - Malicious PDF document detection method, device, equipment and medium - Google Patents

Malicious PDF document detection method, device, equipment and medium Download PDF

Info

Publication number
CN114912114A
CN114912114A CN202210511534.5A CN202210511534A CN114912114A CN 114912114 A CN114912114 A CN 114912114A CN 202210511534 A CN202210511534 A CN 202210511534A CN 114912114 A CN114912114 A CN 114912114A
Authority
CN
China
Prior art keywords
pdf document
detected
keyword
determining
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210511534.5A
Other languages
Chinese (zh)
Inventor
徐晓
黄娜
薛智慧
余小军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202210511534.5A priority Critical patent/CN114912114A/en
Publication of CN114912114A publication Critical patent/CN114912114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

The disclosure relates to the technical field of computers, and provides a malicious PDF document detection method. The method comprises the following steps: acquiring a root object and a logic structure of a PDF document to be detected; determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure; acquiring second keywords contained in first keywords of at least one leaf object, wherein the second keywords comprise first type keywords and second type keywords, the first type keywords comprise one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keywords comprise stream; determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links; according to the preset type, whether the PDF document to be detected is a malicious PDF document or not is detected, and the detection accuracy and efficiency of the malicious PDF document can be improved by adopting the method.

Description

Malicious PDF document detection method, device, equipment and medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting a malicious PDF document.
Background
A Portable Document Format (PDF) is a file Format used for exchanging files in a manner unrelated to an application program, an operating system, and hardware, and is independent of a bottom environment, and is widely applied to business offices for exchanging files.
In the prior art, detection of malicious PDF documents is realized by a static detection method, such as a signature-based matching method, a metadata-and-structure-feature-based machine learning method, and the like, however, in the prior art, because malicious PDF documents stored in a signature library are limited, and time spent on extracting PDF document features is long, there are problems of low detection accuracy and low detection efficiency when detecting malicious PDF documents.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a device and a medium for detecting a malicious PDF document.
The embodiment of the disclosure provides a method for detecting a malicious PDF document, which comprises the following steps:
acquiring a root object and a logic structure of a PDF document to be detected;
determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure;
acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises a stream;
determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links;
and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type.
In one embodiment, determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logical structure includes:
determining an object reference relation tree of the PDF document to be detected according to the root object and the logic structure;
and determining at least one leaf object corresponding to the PDF document to be detected based on the object reference relation tree.
In one embodiment, the obtaining a second keyword included in the first keyword of at least one of the leaf objects includes:
extracting the current target object content of the leaf object;
based on the target object content and the initial object content, judging whether incremental updating exists in the target object content;
and when determining that the target object content has incremental updating, acquiring a second keyword contained in the first keyword of at least one leaf object based on the target object content.
In an embodiment, when the second keyword is the first type keyword, the determining, according to the second keyword, the preset type corresponding to the PDF document to be detected includes:
when the second keyword is at least one of/JS and/JavaScript, determining the Javascript code in the PDF document to be detected;
when the second keyword is at least one of/Launch and/F, determining the embedded file in the PDF document to be detected; and/or
And when the second keyword is/URI, determining the embedded link in the PDF document to be detected.
In an embodiment, when the second keyword is a second type keyword, the determining, according to the second keyword, the preset type corresponding to the PDF document to be detected includes:
judging whether the content corresponding to the stream is a content stream corresponding to the PDF document to be detected;
and if so, analyzing the content stream to determine a preset type corresponding to the PDF document to be detected.
In an embodiment, the obtaining a root object of a PDF document to be detected includes:
judging whether the PDF document to be detected has a file tail label or not;
and when the PDF document to be detected is determined to have the file end label, determining a root object of the PDF document to be detected based on the file end label.
In one embodiment, the method further comprises:
when the fact that the PDF document to be detected does not have the file tail tag is determined, acquiring a preset tag of the PDF document to be detected;
and determining a root object of the PDF document to be detected based on the preset label.
In a second aspect, an embodiment of the present disclosure provides a malicious PDF document detection apparatus, including:
the acquisition module is used for acquiring a root object and a logic structure of the PDF document to be detected;
a leaf object determining module, configured to determine, according to the root object and the logical structure, at least one leaf object corresponding to the PDF document to be detected;
a keyword obtaining module, configured to obtain a second keyword included in a first keyword of at least one leaf object, where the second keyword includes a first type keyword and a second type keyword, the first type keyword includes one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword includes a stream;
a preset type determining module, configured to determine a preset type corresponding to the PDF document to be detected according to the second keyword, where the preset type includes one or more of a Javascript code, an embedded file, and an embedded link;
and the detection module is used for detecting whether the PDF document to be detected is a malicious PDF document according to the preset type.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.
In a fourth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method according to any one of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the malicious PDF document detection method provided by the embodiment of the disclosure, a root object and a logic structure of a PDF document to be detected are obtained; determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure; acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises stream; determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links; and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type. Therefore, the Javascript codes, the embedded files and the embedded links contained in the PDF document to be detected are determined only by acquiring a plurality of keywords related to the malicious PDF document content of the leaf objects of the PDF document to be detected, and then whether the PDF document to be detected is the malicious PDF document is determined, the keywords of all objects contained in the PDF document to be detected are not required to be acquired, and the problems that malicious PDF document samples in a signature library are limited and the time consumption is consumed for extracting the features of the PDF document to be detected in the prior art are solved, so that the accuracy and the efficiency of detecting the malicious PDF document are improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a method for detecting a malicious PDF document according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of an object reference relationship tree of a PDF document according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a malicious PDF document detection apparatus according to an embodiment of the present disclosure;
fig. 4 is an internal structure diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
A Portable Document Format (PDF) is a file Format used for exchanging files in a manner unrelated to an application program, an operating system, and hardware, and is independent of a bottom environment, and is widely applied to business offices for exchanging files. In the prior art, detection of malicious PDF documents is achieved by static detection methods, such as a matching method based on signatures, a machine learning method based on metadata and structural features, and the like.
However, in the prior art, because malicious PDF documents stored in a signature library are limited and time spent on extracting PDF document features is long, the problems of low detection accuracy and low detection efficiency exist in the detection of malicious PDF documents.
Based on the above, the present disclosure provides a malicious PDF document detection method, which obtains a root object and a logical structure of a PDF document to be detected; determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure; acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises stream; determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links; and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type. Therefore, the Javascript codes, the embedded files and the embedded links contained in the PDF document to be detected are determined only by acquiring a plurality of keywords related to the malicious PDF document content of the leaf objects of the PDF document to be detected, and then whether the PDF document to be detected is the malicious PDF document is determined, the keywords of all objects contained in the PDF document to be detected are not required to be acquired, and the problems that malicious PDF document samples in a signature library are limited and the time consumption is consumed for extracting the features of the PDF document to be detected in the prior art are solved, so that the accuracy and the efficiency of detecting the malicious PDF document are improved.
In an embodiment, as shown in fig. 1, fig. 1 is a schematic flow chart of a method for detecting a malicious PDF document according to an embodiment of the present disclosure, which specifically includes the following steps:
s11: and acquiring a root object and a logic structure of the PDF document to be detected.
The root object is used for connecting the physical structure and the logical structure of the PDF document to be detected and is a starting point when the PDF document to be detected is analyzed. The logical structure is determined by the logical relationship between the objects contained in the PDF document to be detected.
Specifically, the PDF document to be detected is analyzed to obtain a root object corresponding to the PDF document to be detected, and a logical relationship between a plurality of objects included in the PDF document to be detected is analyzed to determine a logical structure corresponding to the PDF document to be detected
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, an implementation manner of obtaining a root object of a PDF document to be detected may be:
and judging whether the PDF document to be detected has a file end label.
And when determining that the PDF document to be detected has the file end label, determining the root object of the PDF document to be detected based on the file end label.
The file tail tag is trailer, the file tail position of the PDF document to be detected can be located through the file tail tag, so that file tail information is obtained, and further the root object of the PDF document to be detected is determined according to the file tail information.
Specifically, the PDF document to be detected is analyzed, whether a file tail label trailer exists in the PDF document to be detected is judged, when the file tail label trailer exists in the PDF document to be detected, the position of the file tail of the PDF document to be detected can be located according to the file tail label trailer, file tail information is obtained, and therefore the root object of the PDF document to be detected is determined in the file tail information.
Illustratively, a file tail label trailer is determined to exist in the to-be-detected PDF document, the file tail label trailer is obtained, and the position of the file tail of the to-be-detected PDF document is located, so as to obtain file tail information, such as "trailer </Size < < Size24> >/Prev 14422/Root 130R/Info 10R >", so as to determine the object 13 with the Root object being the version number of 0 according to the keyword Root in the file tail information, but not limited thereto, the present disclosure has no specific limitation, and those skilled in the art can specifically set according to actual situations.
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, another implementation manner of obtaining a root object of a PDF document to be detected may be:
and when determining that the PDF document to be detected does not have a file tail label, acquiring a preset label of the PDF document to be detected.
And determining a root object of the PDF document to be detected based on the preset label.
The preset label is label information with a keyword of Catalog.
Specifically, the PDF document to be detected is analyzed, when the tail label trailer of the PDF document to be detected does not exist, the preset label Catalog of the PDF document to be detected is obtained, and the root object of the PDF document to be detected is determined according to the preset label Catalog.
Illustratively, when the PDF document to be detected determines that there is no file tail label trailer, for an object "130 obj </Type/Catalog/Pages 30R/Names 180R/openaction 220R > > endobj", a preset label Catalog is obtained, which is a root object, but is not limited thereto, and the present disclosure is not particularly limited, and those skilled in the art can specifically set the preset label Catalog according to actual situations.
It should be noted that, because the malicious PDF document can utilize the deleted file tail tag trailer to avoid parsing of the PDF document to be detected, so as to reduce the accuracy of the malicious PDF document, when it is determined that the PDF document to be detected does not have the file tail tag trailer, the root object of the PDF document to be detected is determined by obtaining the preset tag, so as to avoid reducing the accuracy of the malicious PDF document.
S12: and determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure.
The leaf object refers to any one or more of a plurality of objects contained in the PDF document to be detected, and specifically refers to an executed final object.
Specifically, after the root object and the logical structure corresponding to the PDF document to be detected are obtained by analyzing the PDF document to be detected, at least one leaf object corresponding to the PDF document to be detected can be determined according to the root object and the logical structure.
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, an implementation manner of S12 may be:
and determining the object reference relation tree of the PDF document to be detected according to the root object and the logic structure.
The object reference relationship tree is determined according to a logic structure among a plurality of objects contained in the PDF document to be detected, and specifically represents reference relationships among the plurality of objects contained in the PDF document to be detected.
Illustratively, as shown in fig. 2, according to the obtained Root object and the logical structure of the PDF document to be detected, an object reference relationship tree of the PDF document to be detected is constructed, file tail information of the PDF document to be detected is "trailer </Size < < Size24> >/Prev 14422/Root 10R/Info 50R > >", it is determined that the Root object is an object 1 with a version number of 0, after the Root object of the object is determined, and based on content "10 obj </Type/Catalog/Pages 20R/Names 30R/openaction 40R > > endobj" corresponding to the object 1, the object 1 refers to the object 20 obj, the object 30 obj, and the object 40 obj, object contents corresponding to the object 20 obj, the object 30 obj, and the object 40 obj are sequentially obtained, and other objects in the objects 20 obj and 20 obj are traversed until the traversal of other objects in the object 20 obj is ended, and sequentially traversing the objects 30 obj and the other objects quoted in the object 40 obj to determine the logical structure of the PDF document to be detected, and constructing an object reference relationship tree for detecting the PDF document, but the disclosure is not limited thereto, and those skilled in the art may specifically set the logical structure according to actual situations.
And determining at least one leaf object corresponding to the PDF document to be detected based on the object reference relation tree.
Specifically, one or more leaf objects corresponding to the PDF document to be detected are determined according to the constructed object reference relationship tree.
Illustratively, as shown in fig. 2, according to the object reference relationship tree corresponding to the constructed root object obj 10, it can be determined that the leaf objects are the object 20 obj, the object 40 obj, the object 60 obj, and the object 90 obj, but the disclosure is not limited thereto, and those skilled in the art may specifically set the method according to the actual situation.
S13: and acquiring a second key contained in the first key of at least one leaf object.
The second keywords comprise first type keywords and second type keywords, the first type keywords comprise one or more of/JS,/JavaScript,/URI,/Launch,/F, the second type keywords comprise stream, and the first keywords refer to Names specifically.
Specifically, each leaf object is analyzed, so as to obtain a second keyword contained in the first keyword Names in each leaf object.
It should be noted that the second keyword is keyword information necessary for malicious PDF document attack, and therefore, in the embodiment, a plurality of keywords related to a malicious PDF document are obtained when the second keyword is obtained, so that a missing detection situation is avoided when malicious PDF document detection is performed on a PDF document to be detected.
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, an implementation manner of S13 may be:
and extracting the current target object content of the leaf object.
The target object content is obtained in a byte stream manner when the object content of the leaf object is obtained, so that the object content corresponding to a plurality of leaf objects can be obtained, and the target object content is the object content corresponding to the leaf object obtained last time at present.
And judging whether the target object content has incremental updating or not based on the target object content and the initial object content.
The initial object content refers to the object content of the first acquired leaf object when the object content of the leaf object is acquired.
Specifically, whether incremental update exists in the target object content is judged according to the obtained target object content of the current leaf object and the initial object content of the leaf object obtained for the first time.
It should be noted that, after the object content of the leaf object is changed, a logical structure between objects included in the PDF document to be detected is changed, so that in order to ensure accuracy of malicious PDF document detection on the PDF document to be detected, it is necessary to further determine whether incremental update exists in the object content of the currently acquired leaf object.
And when determining that the target object content has incremental update, acquiring a second keyword contained in the first keyword of at least one leaf object based on the target object content.
Specifically, when it is determined that there is a delta update in the target object content through the current target object content and the initial object content, the second keyword included in the first keyword Names of each leaf object is obtained according to the target object content, such as one or more of/JS,/JavaScript,/URI,/Launch,/F, stream.
Illustratively, for an object 20 obj, the corresponding initial object content is "20 obj </Type/Catalog/Pages 20R/Names 30R > > endobj", the target object content currently corresponding to the object 20 obj is "20 obj </Type/Catalog/Pages 20R/Names 30R/openaction 40R > > endobj", and if it is determined that there is an incremental update for the target object content according to the initial object content and the target object content, then a second key included in the first key Names of the leaf objects is further obtained according to the currently corresponding target object content "20 obj </Type/Catalog/Pages 20R/Names 30R/openaction 40R > > endobj", such as JavaScript,/JS,/URI,/Launch,/F, stream, but not limited thereto, the present disclosure is not particularly limited, and those skilled in the art can specifically set it according to actual situations.
In this way, according to the malicious PDF document detection method provided by the embodiment of the present disclosure, the multiple second keywords corresponding to the first keyword in each of the multiple leaf objects are obtained, so that a missing detection situation is avoided when malicious PDF documents to be detected are detected.
S14: and determining the preset type corresponding to the PDF document to be detected according to the second keyword.
The preset type comprises one or more of Javascript codes, embedded files and embedded links.
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, when the second keyword is a first type keyword, an implementation manner of S14 may be:
and when the second keyword is at least one of/JS and/JavaScript, determining the Javascript code in the PDF document to be detected.
Specifically, if the obtained second keyword includes/JS, or JavaScript, or both/JS and/JavaScript, it is determined that the PDF document to be detected includes the JavaScript code.
Optionally, when the second keyword is at least one of/Launch and/F, determining an embedded file in the PDF document to be detected;
specifically, if the obtained second keyword includes/Launch, or is/F, or includes/Launch and/F at the same time, it is determined that the PDF document to be detected includes the embedded file.
Optionally, when the second keyword is/URI, determining an embedded link in the PDF document to be detected.
Specifically, when the obtained second keyword includes/URI, it is determined that the PDF document to be detected includes the embedded link.
It should be noted that the PDF document to be detected may include one of a Javascript code, an embedded file, and an embedded link, or may include two of a Javascript code, an embedded file, and an embedded link at the same time, or include a Javascript code, an embedded file, and an embedded link at the same time.
Therefore, the Javascript codes, the embedded files and the embedded links contained in the PDF document to be detected are obtained by obtaining the plurality of second keywords, so that the problem of missing detection is avoided.
Optionally, on the basis of the foregoing embodiments, in some embodiments of the present disclosure, when the second keyword is a second type keyword, an implementation manner of S14 may further be:
and judging whether the content corresponding to the stream is a content stream corresponding to the PDF document to be detected.
The content corresponding to the stream may be a content stream or an object stream, and because the malicious content of the malicious PDF document is only related to the content stream corresponding to the PDF document to be detected, in order to improve the detection efficiency, it is necessary to further determine whether the stream is the content stream corresponding to the PDF document to be detected, and when there is a content stream, the preset type corresponding to the PDF document to be detected is obtained only according to the content stream.
For example, the determination as to whether the content corresponding to the stream corresponds to the content stream corresponding to the PDF document to be detected may be performed by determining whether the/Type of the object is ObjStm, and when the/Type of the object is not ObjStm, determining that the stream is the content stream corresponding to the PDF document to be detected.
And analyzing the content stream to determine the preset type corresponding to the PDF document to be detected when the content stream is determined to be the content stream corresponding to the PDF document to be detected.
Specifically, when it is determined that the content corresponding to the stream is a content stream corresponding to the PDF document to be detected, the content stream is analyzed, and the Javascript code, the embedded file, and the embedded link included in the PDF document to be detected can be determined by using the js syntax, the file mark, and the website format obtained by the analysis.
In this way, according to the malicious PDF document detection method provided by the embodiment of the present disclosure, the Javascript code, the embedded file, and the embedded link related to the malicious PDF document in the PDF document to be detected are determined according to the first type keyword, and the content stream is further analyzed, so that the Javascript code, the embedded file, and the embedded link related to the malicious PDF document in the PDF document to be detected are determined, and thus, the missing detection phenomenon is avoided, and the accuracy of the malicious PDF document is improved.
S15: and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type.
Specifically, when the PDF document to be detected contains the Javascript code, whether the Javascript code is malicious or not is detected by a malicious code analysis method, when the Javascript code is determined to be malicious, the PDF document to be detected is indicated to be a malicious PDF document, or, when the PDF document to be detected contains an embedded file, determining whether the embedded file is an executable file, upon determining that the embedded file is an executable file, detecting whether the embedded file is malicious using a malicious executable file analysis method, upon determining that the embedded file is malicious, indicating that the PDF document to be detected is a malicious PDF document, or indicating that when the PDF document to be detected contains an embedded link, acquiring the content corresponding to the embedded link, detecting whether the embedded link is malicious or not through the acquired content, and when the acquired content is malicious, determining that the embedded link is malicious, and indicating that the PDF document to be detected is a malicious PDF document.
It should be noted that, for the malicious code analysis method and the malicious executable file analysis method, reference may be made to the prior art, and details of the disclosure are not repeated herein.
In this way, the malicious PDF document detection method provided by the embodiment of the present disclosure obtains a root object and a logical structure of a PDF document to be detected; determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure; acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises stream; determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links; and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type. Therefore, the Javascript codes, the embedded files and the embedded links contained in the PDF document to be detected are determined only by acquiring a plurality of keywords related to the malicious PDF document content of the leaf objects of the PDF document to be detected, and then whether the PDF document to be detected is the malicious PDF document is determined, the keywords of all objects contained in the PDF document to be detected are not required to be acquired, and the problems that malicious PDF document samples in a signature library are limited and the time consumption is consumed for extracting the features of the PDF document to be detected in the prior art are solved, so that the accuracy and the efficiency of detecting the malicious PDF document are improved.
Fig. 3 is a malicious PDF document detection device provided by an embodiment of the present disclosure, including: the device comprises an acquisition module 11, a leaf object determination module 12, a keyword acquisition module 13, a preset type determination module 14 and a detection module 15.
The acquisition module 11 is configured to acquire a root object and a logical structure of a PDF document to be detected;
a leaf object determining module 12, configured to determine, according to the root object and the logical structure, at least one leaf object corresponding to the PDF document to be detected;
a keyword obtaining module 13, configured to obtain a second keyword included in a first keyword of at least one leaf object, where the second keyword includes a first type keyword and a second type keyword, the first type keyword includes one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword includes a stream;
the preset type determining module 14 is configured to determine a preset type corresponding to the PDF document to be detected according to the second keyword, where the preset type includes one or more of a Javascript code, an embedded file, and an embedded link;
and the detection module 15 is configured to detect whether the PDF document to be detected is a malicious PDF document according to a preset type.
In the above embodiment, the leaf object determining module 12 is specifically configured to determine, according to the root object and the logical structure, an object reference relationship tree of the PDF document to be detected; and determining at least one leaf object corresponding to the PDF document to be detected based on the object reference relation tree.
In the above embodiment, the keyword obtaining module 13 is specifically configured to extract the current target object content of the leaf object; judging whether the target object content has incremental update or not based on the target object content and the initial object content; and when the target object content is determined to have incremental updating, acquiring a second keyword contained in the first keyword of at least one leaf object based on the target object content.
In the above embodiment, the preset type determining module 14 is specifically configured to determine the JavaScript code in the PDF document to be detected when the second keyword is at least one of/JS and/JavaScript; when the second keyword is at least one of/Launch and/F, determining the embedded file in the PDF document to be detected; and/or when the second keyword is/URI, determining the embedded link in the PDF document to be detected.
In the above embodiment, the preset type determining module 14 is specifically configured to determine whether content corresponding to a stream is a content stream corresponding to a PDF document to be detected; and if so, analyzing the content stream to determine the preset type corresponding to the PDF document to be detected.
In the above embodiment, the obtaining module 13 is specifically configured to determine whether a PDF document to be detected has an end-of-file tag; and when determining that the PDF document to be detected has the file end tag, determining the root object of the PDF document to be detected based on the file end tag.
In the above embodiment, the obtaining module 13 is specifically configured to obtain a preset tag of the PDF document to be detected when it is determined that the PDF document to be detected does not have a file end tag; and determining a root object of the PDF document to be detected based on the preset label.
Thus, the embodiment is used for acquiring the root object and the logical structure of the PDF document to be detected through the acquiring module 11; a leaf object determining module 12, configured to determine, according to the root object and the logical structure, at least one leaf object corresponding to the PDF document to be detected; a keyword obtaining module 13, configured to obtain a second keyword included in a first keyword of at least one leaf object, where the second keyword includes a first type keyword and a second type keyword, the first type keyword includes one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword includes a stream; the preset type determining module 14 is configured to determine a preset type corresponding to the PDF document to be detected according to the second keyword, where the preset type includes one or more of a Javascript code, an embedded file, and an embedded link; and the detection module 15 is configured to detect whether the PDF document to be detected is a malicious PDF document according to a preset type. Therefore, the Javascript codes, the embedded files and the embedded links contained in the PDF document to be detected are determined only by acquiring a plurality of keywords related to the malicious PDF document content of the leaf objects of the PDF document to be detected, and then whether the PDF document to be detected is the malicious PDF document is determined, the keywords of all objects contained in the PDF document to be detected are not required to be acquired, and the problems that malicious PDF document samples in a signature library are limited and the time consumption is consumed for extracting the features of the PDF document to be detected in the prior art are solved, so that the accuracy and the efficiency of detecting the malicious PDF document are improved.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, as shown in fig. 4, the electronic device includes a processor 710, a memory 720, an input device 730, and an output device 740; the number of processors 710 in the computer device may be one or more, and one processor 710 is taken as an example in fig. 4; the processor 710, the memory 720, the input device 730, and the output device 740 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.
Memory 720, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the methods of embodiments of the present invention. The processor 710 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 720, namely, implements the method provided by the embodiment of the present invention.
The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 720 may further include memory located remotely from the processor 710, which may be connected to a computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, and may include a keyboard, a mouse, and the like. The output device 740 may include a display device such as a display screen.
The disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to implement a method provided by an embodiment of the present invention, the method including:
acquiring a root object and a logic structure of a PDF document to be detected;
determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure;
acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises stream;
determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links;
and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It is noted that, in this document, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A malicious PDF document detection method is characterized by comprising the following steps:
acquiring a root object and a logic structure of a PDF document to be detected;
determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logic structure;
acquiring a second keyword contained in a first keyword of at least one leaf object, wherein the second keyword comprises a first type keyword and a second type keyword, the first type keyword comprises one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword comprises a stream;
determining a preset type corresponding to the PDF document to be detected according to the second keyword, wherein the preset type comprises one or more of Javascript codes, embedded files and embedded links;
and detecting whether the PDF document to be detected is a malicious PDF document or not according to the preset type.
2. The method according to claim 1, wherein determining at least one leaf object corresponding to the PDF document to be detected according to the root object and the logical structure comprises:
determining an object reference relationship tree of the PDF document to be detected according to the root object and the logic structure;
and determining at least one leaf object corresponding to the PDF document to be detected based on the object reference relation tree.
3. The method of claim 1, wherein the obtaining a second key included in the first key of at least one of the leaf objects comprises:
extracting the current target object content of the leaf object;
judging whether incremental updating exists in the target object content or not based on the target object content and the initial object content;
and when determining that the target object content has incremental updating, acquiring a second keyword contained in the first keyword of at least one leaf object based on the target object content.
4. The method according to claim 1, wherein when the second keyword is the first type keyword, the determining the preset type corresponding to the PDF document to be detected according to the second keyword comprises:
when the second keyword is at least one of/JS and/JavaScript, determining the Javascript code in the PDF document to be detected;
when the second keyword is at least one of/Launch and/F, determining the embedded file in the PDF document to be detected; and/or
And when the second keyword is/URI, determining the embedded link in the PDF document to be detected.
5. The method according to claim 1, wherein when the second keyword is a second type keyword, the determining, according to the second keyword, the preset type corresponding to the PDF document to be detected comprises:
judging whether the content corresponding to the stream is a content stream corresponding to the PDF document to be detected;
and if so, analyzing the content stream to determine the preset type corresponding to the PDF document to be detected.
6. The method according to claim 1, wherein the obtaining a root object of the PDF document to be detected comprises:
judging whether the PDF document to be detected has a file tail label or not;
and when the PDF document to be detected is determined to have the file end label, determining a root object of the PDF document to be detected based on the file end label.
7. The method of claim 6, further comprising:
when the fact that the PDF document to be detected does not have the file tail tag is determined, acquiring a preset tag of the PDF document to be detected;
and determining a root object of the PDF document to be detected based on the preset label.
8. A malicious PDF document detection device is characterized by comprising:
the acquisition module is used for acquiring a root object and a logic structure of the PDF document to be detected;
a leaf object determining module, configured to determine, according to the root object and the logical structure, at least one leaf object corresponding to the PDF document to be detected;
a keyword obtaining module, configured to obtain a second keyword included in a first keyword of at least one leaf object, where the second keyword includes a first type keyword and a second type keyword, the first type keyword includes one or more of/JS,/JavaScript,/URI,/Launch,/F, and the second type keyword includes a stream;
a preset type determining module, configured to determine a preset type corresponding to the PDF document to be detected according to the second keyword, where the preset type includes one or more of a Javascript code, an embedded file, and an embedded link;
and the detection module is used for detecting whether the PDF document to be detected is a malicious PDF document according to the preset type.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the malicious PDF document detection method of any one of claims 1 to 7.
10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the malicious PDF document detection method according to any one of claims 1 to 7.
CN202210511534.5A 2022-05-11 2022-05-11 Malicious PDF document detection method, device, equipment and medium Pending CN114912114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210511534.5A CN114912114A (en) 2022-05-11 2022-05-11 Malicious PDF document detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210511534.5A CN114912114A (en) 2022-05-11 2022-05-11 Malicious PDF document detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114912114A true CN114912114A (en) 2022-08-16

Family

ID=82767506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210511534.5A Pending CN114912114A (en) 2022-05-11 2022-05-11 Malicious PDF document detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114912114A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8205261B1 (en) * 2006-03-31 2012-06-19 Emc Corporation Incremental virus scan
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
CN103279710A (en) * 2013-04-12 2013-09-04 深圳市易聆科信息技术有限公司 Method and system for detecting malicious codes of Internet information system
CN113111350A (en) * 2021-04-28 2021-07-13 北京天融信网络安全技术有限公司 Malicious PDF file detection method and device and electronic equipment
CN113536300A (en) * 2021-07-12 2021-10-22 杭州安恒信息技术股份有限公司 PDF file trust filtering and analyzing method, device, equipment and medium
CN114022889A (en) * 2021-11-15 2022-02-08 北京天融信网络安全技术有限公司 Malicious document detection method and device
CN114154150A (en) * 2021-11-05 2022-03-08 苏州浪潮智能科技有限公司 Virus detection method and device for pre-loading configuration of hijack system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8205261B1 (en) * 2006-03-31 2012-06-19 Emc Corporation Incremental virus scan
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
CN103279710A (en) * 2013-04-12 2013-09-04 深圳市易聆科信息技术有限公司 Method and system for detecting malicious codes of Internet information system
CN113111350A (en) * 2021-04-28 2021-07-13 北京天融信网络安全技术有限公司 Malicious PDF file detection method and device and electronic equipment
CN113536300A (en) * 2021-07-12 2021-10-22 杭州安恒信息技术股份有限公司 PDF file trust filtering and analyzing method, device, equipment and medium
CN114154150A (en) * 2021-11-05 2022-03-08 苏州浪潮智能科技有限公司 Virus detection method and device for pre-loading configuration of hijack system
CN114022889A (en) * 2021-11-15 2022-02-08 北京天融信网络安全技术有限公司 Malicious document detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹祺: "《情报学视域下的数据研究:理论、原理与方法》", 31 October 2018, 武汉大学出版社 *

Similar Documents

Publication Publication Date Title
CN108763928B (en) Open source software vulnerability analysis method and device and storage medium
US8171004B1 (en) Use of hash values for identification and location of content
KR101337874B1 (en) System and method for detecting malwares in a file based on genetic map of the file
WO2019237540A1 (en) Method and device for acquiring financial data, terminal device, and medium
US7814070B1 (en) Surrogate hashing
EP3839785B1 (en) Characterizing malware files for similarity searching
EP2693356B1 (en) Detecting pirated applications
CN109902073B (en) Log processing method and device, computer equipment and computer readable storage medium
CN108900554B (en) HTTP asset detection method, system, device and computer medium
US20140059684A1 (en) System and method for computer inspection of information objects for shared malware components
CN108021598B (en) Page extraction template matching method and device and server
CN109983464B (en) Detecting malicious scripts
US11263062B2 (en) API mashup exploration and recommendation
US7801868B1 (en) Surrogate hashing
Li et al. FEPDF: a robust feature extractor for malicious PDF detection
CN111506342A (en) Version difference detection method and device, electronic equipment and storage medium
CN111488556A (en) Nested document extraction method and device, electronic equipment and storage medium
CN112579937A (en) Character highlight display method and device
CN103500309A (en) Method and device for detecting and killing macro virus
CN114912114A (en) Malicious PDF document detection method, device, equipment and medium
CN112612866B (en) Knowledge base text synchronization method and device, electronic equipment and storage medium
CN114006706A (en) Network security detection method, system, computer device and readable storage medium
US11182551B2 (en) System and method for determining document version geneology
CN111858476A (en) File processing method and device, electronic equipment and computer readable storage medium
CN108171014B (en) Method and system for detecting RTF suspicious file and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220816