CN112231701A - PDF file processing method and device - Google Patents

PDF file processing method and device Download PDF

Info

Publication number
CN112231701A
CN112231701A CN202011068435.1A CN202011068435A CN112231701A CN 112231701 A CN112231701 A CN 112231701A CN 202011068435 A CN202011068435 A CN 202011068435A CN 112231701 A CN112231701 A CN 112231701A
Authority
CN
China
Prior art keywords
file
pdf file
pdf
information
received
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011068435.1A
Other languages
Chinese (zh)
Inventor
李奏换
陈婉君
李香月
汪龙节
李振韬
梁维新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wilson Information Technology Co ltd
Original Assignee
Guangzhou Wilson Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Wilson Information Technology Co ltd filed Critical Guangzhou Wilson Information Technology Co ltd
Priority to CN202011068435.1A priority Critical patent/CN112231701A/en
Publication of CN112231701A publication Critical patent/CN112231701A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a PDF file processing method and a device, wherein the PDF file processing method comprises a malicious code identification method in a PDF file and a PDF file storage method, and the processing method judges whether the file information contains a JavaScript field or not by scanning a received first PDF file by using a pdfid. Usually, the JavaScript codes are nested in the malicious PDF files, so that the abnormal condition of the received first PDF file can be found in time, and the risk of computer poisoning caused by opening the file by a user under the condition of not detecting computer viruses is avoided.

Description

PDF file processing method and device
Technical Field
The invention relates to the technical field of PDF file processing, in particular to a PDF file processing method and device.
Background
With the development of information technology, many government or enterprise policy documents are issued in the form of PDF documents instead of paper documents. For example, for a car sales company, a car sales merchant may often receive a PDF file of a new car sales rebate policy sent by a previous website in order to adjust the sales policy according to the new policy file.
In the prior art, when a merchant receives a new PDF file, the content of the file is generally confirmed by manually opening the PDF file on a computer, but the computer is easily infected by computer viruses.
Disclosure of Invention
The invention aims to provide a method and a device for identifying malicious codes in a PDF (portable document format) file, a method and a device for saving the PDF file, a computer-readable storage medium and electronic equipment, which can detect the received PDF file, discover computer viruses carried in the PDF file in time and reduce the probability of infecting the computer viruses when the PDF file is opened on a computer.
In a first aspect, an embodiment of the present invention provides a method for identifying malicious codes in a PDF file, where the method includes the following steps:
scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
if yes, judging that the received first PDF file contains malicious codes.
Further, the identification method further comprises the following steps:
and sending a prompt signal to an alarm device under the condition that the received first PDF file contains malicious codes.
In a second aspect, an embodiment of the present invention provides a PDF file saving method, where the saving method includes the following steps:
scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
if not, judging that the received first PDF file does not contain malicious codes and saving the first PDF file to a local storage.
Further, after the step of determining that the received first PDF file does not contain malicious code, and before the step of saving the first PDF file to a local storage, the method further includes the following steps:
comparing the received first PDF file with a plurality of second PDF files in a local storage;
judging whether the received first PDF file is a second PDF file already existing in a local storage or not according to the comparison result of the file data;
and if not, executing the step of saving the first PDF file to a local storage.
Further, the storage method further comprises the following steps:
and under the condition that the received first PDF file is judged to be a second PDF file already existing in a local storage, marking the first PDF file as a received PDF file.
Further, the document data includes text content, and the document data comparison includes the following steps:
extracting the text content of the first PDF file;
comparing the text content of the first PDF file with the text content of a plurality of second PDF files in sequence;
and generating a result of the file data comparison according to the result of the text content comparison.
Further, the file data includes file attachment information, the file attachment information includes file creation date information, file size information, and Header information of the PDF file, and the file data comparison includes the following steps:
comparing the file attachment information of the first PDF file with the file attachment information of a plurality of second PDF files in sequence;
and generating a result of the file data comparison according to the result of the file attached information comparison.
In a third aspect, an embodiment of the present invention provides an apparatus for identifying malicious codes in a PDF file, where the apparatus includes:
the first scanning module is used for scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
the first judging module is used for judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
the first judging module is used for judging that the received first PDF file contains malicious codes under the condition that the file information contains a JavaScript field.
In a fourth aspect, an embodiment of the present invention provides a PDF file saving apparatus, where the saving apparatus includes:
the second scanning module is used for scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
the second judgment module is used for judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
and the storage module is used for judging that the received first PDF file does not contain malicious codes and storing the first PDF file to a local storage under the condition that the file information does not contain a JavaScript field.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method for identifying malicious codes in the PDF file or the method for saving the PDF file.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute the method for identifying malicious codes in a PDF file or the method for saving a PDF file according to any one of the above methods.
Compared with the prior art, the method and the device for identifying the malicious codes in the PDF file, the method and the device for saving the PDF file, the computer-readable storage medium and the electronic device provided by the invention scan the received first PDF file by using a pdfit. Usually, the JavaScript codes are nested in the malicious PDF files, so that the abnormal condition of the received first PDF file can be found in time, and the risk of computer poisoning caused by opening the file by a user under the condition of not detecting computer viruses is avoided.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The invention is further described below with reference to the accompanying drawings and examples;
fig. 1 is an application environment diagram of a PDF file processing method in one embodiment.
Fig. 2 is a flowchart illustrating a method for identifying malicious codes in a PDF file in one embodiment.
Fig. 3 is a flowchart illustrating a method for identifying malicious codes in a PDF file according to another embodiment.
Fig. 4 is a flowchart illustrating a PDF file saving method in one embodiment.
Fig. 5 is a flowchart illustrating a PDF file saving method in one embodiment.
Fig. 6 is a flowchart illustrating a PDF file saving method in one embodiment.
FIG. 7 is a schematic diagram illustrating a comparison of file data in one embodiment;
FIG. 8 is a schematic diagram showing a flowchart of file data comparison in another embodiment;
fig. 9 is a block diagram illustrating a configuration of a malicious code recognition apparatus in a PDF file according to an embodiment.
Fig. 10 is a block diagram showing the configuration of a PDF file saving apparatus according to an embodiment.
FIG. 11 is a block diagram of a computer device in one embodiment.
FIG. 12 is an illustration of an operation interface for scanning a received first PDF file using the pdfid.
FIG. 13 is an illustration of an operation interface for scanning a received first PDF file using the pdfid.
Reference numerals:
110. a terminal; 120. a server; 801. a first scanning module; 802. a first judgment module; 803. a first determination module; 811. a second scanning module; 812. a second judgment module; 813. a storage module;
Detailed Description
Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
To facilitate understanding of the embodiments of the present invention by those skilled in the art, the matters referred to in the embodiments of the present invention are explained.
A basic PDF file would contain the following four elements:
header, for identifying the version number of the PDF file.
Body, which will contain the file object that generated the document.
Cross-reference table, will contain information that references objects in the file.
Trailer information, which may give some special objects that the file entity contains, and the location of the cross reference table.
Python is a cross-platform computer programming language. Is a high-level scripting language that combines interpretive, compiled, interactive, and object-oriented capabilities.
py is a command line tool of Python, which scans PDF files to find characteristic PDF keywords, and is used in the embodiment of the present invention to identify PDF files containing JavaScript. For a detailed description of the pdfid.
https://www.decalage.info/python/pdfid
Fig. 1 is an application environment diagram of a PDF file processing method in one embodiment. The PDF file processing method comprises a malicious code identification method and a PDF file storage method in the PDF file. The PDF file processing method runs on a PDF file processing system. The PDF file processing system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal 110 or a mobile terminal 110, and the mobile terminal 110 may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server 120 or as a server cluster of multiple servers 120.
Hereinafter, the PDF file processing method provided by the embodiment of the present invention will be described and explained in detail by several specific embodiments.
As shown in fig. 2, in one embodiment, a method of identifying malicious code in a PDF file is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may specifically be the terminal 110 in fig. 1 described above.
Referring to fig. 2, the method for identifying malicious codes in a PDF file specifically includes the following steps:
step S102: the terminal 110 scans the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
step S104: the terminal 110 determines whether the file information contains a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
for example, referring to fig. 12, a first PDF file is scanned by a pdfid.py tool, a command line operation result can be exported as a text, and the content in the text is read by means of text recognition, so that the first PDF file is recognized to have 7 obj objects and 1 JavaScript object, and thus the manner is determined to determine that the first PDF file may contain a malignant code.
Step S106: if yes, the terminal 110 determines that the received first PDF file contains malicious codes.
It should be noted that generally, the malicious PDF files are nested with JavaScript codes, so that abnormal situations of the received first PDF file can be found in time, and a risk of computer poisoning caused by opening the file by a user without detecting a computer virus is avoided.
It is understood that each PDF file contains the first 7 fields, and may not contain stream and endstream.
The/xref cross reference table describes the number, version, and absolute file location of each indirect object. The first index in the PDF document must start with object number 0 of version 65535, the first number following the identifier/xref being the number of the first indirect object (i.e., object number 0), and the second number being the size of/xref (cross reference table).
the/Page indicates the number of pages of the PDF file, and most malicious PDF files only have one Page.
the/Encrypt indicates that the PDF file has a digital watermark or is encrypted.
the/ObjStm is the number of object streams. Object streams is a stream Object that may contain other Object objects.
the/JS and/JavaScript indicate that the PDF file contains embedded JavaScript code. Usually, JavaScript codes are nested in malicious PDF files, and here, a heap spray (heap spray) is generally implemented by using a parsing vulnerability of JavaScript or using JavaScript, and there are also many normal PDF files that contain JavaScript codes.
the/AA,/OpenAction, and/AcroForm specify that when viewing a PDF file or a page of a PDF, there will be an action to execute with it, and almost all malicious PDF files embedded with JavaScript code have an action (action) to automatically execute the JavaScript code. If a PDF file contains key fields for automatically executing actions by/AA or/OpenAction and JavaScript code, the PDF file is very likely to be a malicious PDF file.
URI this key field is needed if you are going to perform an action in a PDF file to open a web page.
the/Filter is generally FlateDecode, and a ZLib compression-decompression algorithm is used.
JBIG2Decode indicates that the PDF file has been compressed using JBIG 2. Although the JBIG2 compression itself may be vulnerable (CVE-2010-1297). But the/JBIG 2Decode key field cannot specify whether the PDF file is suspicious.
RichMedia Flash file.
Number of Launch execution actions (actions).
Referring to fig. 3, in an embodiment, the method for identifying malicious codes in a PDF file specifically includes the following steps:
step S202: the terminal 110 scans the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
step S204: the terminal 110 determines whether the file information contains a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
for example, referring to fig. 12, a first PDF file is scanned by a pdfid.py tool, a command line operation result can be exported as a text, and the content in the text is read by means of text recognition, so that the first PDF file is recognized to have 7 obj objects and 1 JavaScript object, and thus the manner is determined to determine that the first PDF file may contain a malignant code.
Step S206: if yes, the terminal 110 determines that the received first PDF file contains malicious codes.
It should be noted that generally, the malicious PDF files are nested with JavaScript codes, so that abnormal situations of the received first PDF file can be found in time, and a risk of computer poisoning caused by opening the file by a user without detecting a computer virus is avoided.
Step S208: and when judging that the received first PDF file contains malicious codes, the terminal 110 sends a prompt signal to an alarm device.
It can be understood that the alarm device can be a display, and the corresponding prompt signal is a text reminding message displayed on the display; the alarm device may be the mobile terminal 110 of the operator, and the corresponding prompt signal is a short message pushed to the mobile terminal 110. The terminal 110 can prompt an operator to process in time by sending a prompt signal when the malicious code is found, so that the computer risk is reduced.
In another embodiment, as shown in fig. 4, a PDF file saving method is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may specifically be the terminal 110 in fig. 1 described above.
Referring to fig. 4, the PDF file saving method specifically includes the following steps:
step S302: the terminal 110 scans the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
step S304: the terminal 110 determines whether the file information contains a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
step S306: if not, the terminal 110 determines that the received first PDF file does not contain malicious codes and stores the first PDF file to a local storage.
It can be understood that generally malicious PDF files are nested with JavaScript codes, and abnormal situations of the received first PDF file can be found in time, so that the risk of computer poisoning caused by opening the file by a user without detecting computer viruses is avoided.
Specifically, referring to fig. 13, the secure PDF file does not include a JavaScript object, and therefore, only when the first PDF file is determined to be free of malicious codes, the first PDF file is stored, so that PDF files subsequently read by the user are all secure.
Referring to fig. 5, in an embodiment, the PDF file saving method specifically includes the following steps:
step S402: the terminal 110 scans the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
step S404: the terminal 110 determines whether the file information contains a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
step S406: if not, the terminal 110 determines that the received first PDF file does not contain malicious codes;
step S408: the terminal 110 compares the received first PDF file with a plurality of second PDF files in a local storage;
step S410: the terminal 110 determines whether the received first PDF file is a second PDF file already existing in the local storage according to the result of the file data comparison;
step S412: if not, the terminal 110 saves the first PDF file to a local storage.
It can be understood that, since the user cannot read the PDF file immediately when receiving the PDF file, the received first PDF file needs to be saved in the local storage, and in the embodiment, the newly received PDF file is compared with the file already stored in the local storage, and only the new file is saved, so that the local storage resource can be saved.
Specifically, the present embodiment provides two examples of comparing file data.
Referring to fig. 7, in one example, the document data includes text content, and the document data comparison includes the steps of:
step S602: the terminal 110 extracts the text content of the first PDF file;
step S602: the terminal 110 compares the text content of the first PDF file with the text content of a plurality of second PDF files in sequence;
step S602: the terminal 110 generates a result of the file data comparison according to the result of the text content comparison.
It can be understood that the first PDF file and the second PDF file are compared in text content, so that whether the two PDF files are consistent in content can be determined, and the determination accuracy is improved.
Referring to fig. 8, in another example, the file data includes file attachment information including file creation date information, file size information, and Header information of the PDF file, and the file data comparison includes the steps of:
step S702: the terminal 110 compares the file attachment information of the first PDF file with the file attachment information of a plurality of second PDF files in sequence;
step S702: the terminal 110 generates a result of the file data comparison according to the result of the file attached information comparison.
It can be understood that by comparing only the file attached information, the text content does not need to be compared, and the comparison efficiency can be accelerated.
Referring to fig. 6, the PDF file saving method specifically includes the following steps:
step S502: the terminal 110 scans the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
step S504: the terminal 110 determines whether the file information contains a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
step S506: if not, the terminal 110 determines that the received first PDF file does not contain malicious codes;
step S508: the terminal 110 compares the received first PDF file with a plurality of second PDF files in a local storage;
step S510: the terminal 110 determines whether the received first PDF file is a second PDF file already existing in the local storage according to the result of the file data comparison;
step S512: if yes, the terminal 110 marks the first PDF file as a received PDF file.
It can be understood that, in this embodiment, when it is determined that the newly received first PDF file already exists as the local storage, the first PDF file is marked as the received PDF file, for example, the document name of the first PDF file may be modified, so that a subsequent user can find the first PDF file conveniently.
As shown in fig. 9, in one embodiment, there is provided a malicious code identification apparatus in a PDF file, the identification apparatus comprising:
a first scanning module 801, configured to scan a received first PDF file by using a pdfit. Wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
a first determining module 802, configured to determine whether the file information includes a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
a first determining module 803, configured to determine that the received first PDF file contains malicious code when the file information contains a JavaScript field.
The embodiment of the malicious code identification device in the PDF file and the embodiment of the malicious code identification method in the PDF file are based on the same inventive concept, and are not repeated herein.
As shown in fig. 10, in one embodiment, there is provided a PDF file saving device including:
a second scanning module 811, configured to scan the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
a second determining module 812, configured to determine whether the file information includes a JavaScript field; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
the saving module 813 is configured to determine that the received first PDF file does not contain a malicious code and save the first PDF file to a local storage when the file information does not include a JavaScript field.
The embodiment of the PDF file saving device and the embodiment of the PDF file saving method are based on the same inventive concept, and are not described herein again.
FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 11, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by a processor, causes the processor to implement the PDF file processing method. The internal memory may also store a computer program, and the computer program, when executed by the processor, may cause the processor to perform the PDF file processing method. Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the inventive arrangements and is not intended to limit the computing devices to which the inventive arrangements may be applied, as a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the malicious code identification apparatus in a PDF file provided by the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 11. The memory of the computer device may store various program modules constituting the malicious code recognition apparatus in the PDF file, such as the first scanning module 801, the first judging module 802, and the first judging module 803 shown in fig. 9. The computer program constituted by the program modules causes the processor to execute the steps of the method for identifying malicious codes in the PDF file according to the embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 11 may perform the step of scanning the received first PDF file by using a pdfit. The step of judging whether the file information contains a JavaScript field is executed by the first judging module 803; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file. When the file information includes a JavaScript field, the first determination module 803 determines that the received first PDF file includes a malicious code.
In one embodiment, there is provided an electronic device including: the processor executes the program and executes the steps of the PDF file processing method. The steps of the PDF file processing method herein may be steps in the PDF file processing methods of the above-described respective embodiments.
In one embodiment, a computer-readable storage medium is provided, which stores computer-executable instructions for causing a computer to perform the steps of the above-described PDF file processing method. The steps of the PDF file processing method herein may be steps in the PDF file processing methods of the above-described respective embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRA), Rambus Direct RAM (RDRA), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims (10)

1. A method for identifying malicious codes in a PDF file is characterized by comprising the following steps:
scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
if yes, judging that the received first PDF file contains malicious codes.
2. The method for identifying malicious codes in a PDF file according to claim 1, wherein said method further comprises the following steps:
and sending a prompt signal to an alarm device under the condition that the received first PDF file contains malicious codes.
3. A PDF file saving method is characterized by comprising the following steps:
scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
if not, judging that the received first PDF file does not contain malicious codes and saving the first PDF file to a local storage.
4. The method as claimed in claim 3, further comprising the following steps after the step of determining that the received first PDF file does not contain malicious code and before the step of saving the first PDF file to a local storage:
comparing the received first PDF file with a plurality of second PDF files in a local storage;
judging whether the received first PDF file is a second PDF file already existing in a local storage or not according to the comparison result of the file data;
and if not, executing the step of saving the first PDF file to a local storage.
5. The PDF file saving method according to claim 4, further comprising the steps of:
and under the condition that the received first PDF file is judged to be a second PDF file already existing in a local storage, marking the first PDF file as a received PDF file.
6. The PDF file saving method as claimed in claim 4, wherein the file data includes text content, and the file data comparison comprises the steps of:
extracting the text content of the first PDF file;
comparing the text content of the first PDF file with the text content of a plurality of second PDF files in sequence;
and generating a result of the file data comparison according to the result of the text content comparison.
7. The PDF file saving method according to claim 4, wherein the file data includes file attachment information including file creation date information, file size information and Header information of the PDF file, and the file data comparison comprises the steps of:
comparing the file attachment information of the first PDF file with the file attachment information of a plurality of second PDF files in sequence;
and generating a result of the file data comparison according to the result of the file attached information comparison.
8. An apparatus for identifying malicious code in a PDF file, the apparatus comprising:
the first scanning module is used for scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
the first judging module is used for judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
the first judging module is used for judging that the received first PDF file contains malicious codes under the condition that the file information contains a JavaScript field.
9. A PDF file saving apparatus, comprising:
the second scanning module is used for scanning the received first PDF file by using a pdfid.py tool of python to obtain file information of the first PDF file; wherein the file information comprises a plurality of fields, and each field is used for indicating the related information of the first PDF file;
the second judgment module is used for judging whether the file information contains a JavaScript field or not; the JavaScript field is used for indicating that JavaScript codes are embedded in the first PDF file;
and the storage module is used for judging that the received first PDF file does not contain malicious codes and storing the first PDF file to a local storage under the condition that the file information does not contain a JavaScript field.
10. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of identifying malicious code in a PDF file according to any one of claims 1 to 2 or a method of saving a PDF file according to any one of claims 3 to 7 when executing the program.
CN202011068435.1A 2020-09-29 2020-09-29 PDF file processing method and device Pending CN112231701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011068435.1A CN112231701A (en) 2020-09-29 2020-09-29 PDF file processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011068435.1A CN112231701A (en) 2020-09-29 2020-09-29 PDF file processing method and device

Publications (1)

Publication Number Publication Date
CN112231701A true CN112231701A (en) 2021-01-15

Family

ID=74119846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011068435.1A Pending CN112231701A (en) 2020-09-29 2020-09-29 PDF file processing method and device

Country Status (1)

Country Link
CN (1) CN112231701A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651060A (en) * 2012-03-31 2012-08-29 北京奇虎科技有限公司 Method and system for detecting vulnerability
US20130097705A1 (en) * 2011-10-14 2013-04-18 Trustwave Corporation Identification of electronic documents that are likely to contain embedded malware
US20130160127A1 (en) * 2011-12-14 2013-06-20 Korea Internet & Security Agency System and method for detecting malicious code of pdf document type
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
CN103310150A (en) * 2012-03-13 2013-09-18 百度在线网络技术(北京)有限公司 Method and device for detecting portable document format (PDF) vulnerability
CN105095756A (en) * 2015-07-06 2015-11-25 北京金山安全软件有限公司 Method and device for detecting portable document format document
US9305170B1 (en) * 2013-03-13 2016-04-05 Symantec Corporation Systems and methods for securely providing information external to documents
CN105868630A (en) * 2016-03-24 2016-08-17 中国科学院信息工程研究所 Malicious PDF document detection method
CN107944273A (en) * 2017-12-14 2018-04-20 贵州航天计量测试技术研究所 A kind of malice PDF document detection method based on TF IDF algorithms and SVDD algorithms
CN108959930A (en) * 2018-07-26 2018-12-07 中国民航大学 Malice PDF detection method, system, data storage device and detection program

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097705A1 (en) * 2011-10-14 2013-04-18 Trustwave Corporation Identification of electronic documents that are likely to contain embedded malware
US20130160127A1 (en) * 2011-12-14 2013-06-20 Korea Internet & Security Agency System and method for detecting malicious code of pdf document type
CN103310150A (en) * 2012-03-13 2013-09-18 百度在线网络技术(北京)有限公司 Method and device for detecting portable document format (PDF) vulnerability
CN102651060A (en) * 2012-03-31 2012-08-29 北京奇虎科技有限公司 Method and system for detecting vulnerability
CN103221960A (en) * 2012-12-10 2013-07-24 华为技术有限公司 Detection method and apparatus of malicious code
US9305170B1 (en) * 2013-03-13 2016-04-05 Symantec Corporation Systems and methods for securely providing information external to documents
CN105095756A (en) * 2015-07-06 2015-11-25 北京金山安全软件有限公司 Method and device for detecting portable document format document
CN105868630A (en) * 2016-03-24 2016-08-17 中国科学院信息工程研究所 Malicious PDF document detection method
CN107944273A (en) * 2017-12-14 2018-04-20 贵州航天计量测试技术研究所 A kind of malice PDF document detection method based on TF IDF algorithms and SVDD algorithms
CN108959930A (en) * 2018-07-26 2018-12-07 中国民航大学 Malice PDF detection method, system, data storage device and detection program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
17BDW: ""恶意PDF文档分析记录"", 《HTTPS://WWW.CNBLOGS.COM/17BDW/P/7215527.HTML》 *

Similar Documents

Publication Publication Date Title
US7234165B1 (en) Malware scanning of compressed computer files
CN111310205B (en) Sensitive information detection method, device, computer equipment and storage medium
US8549624B2 (en) Probabilistic shellcode detection
US20080320387A1 (en) Information displaying device and information displaying method
CN111901337B (en) File uploading method, system and storage medium
CN110908778B (en) Task deployment method, system and storage medium
CN110096889B (en) File detection method, device, equipment and computer readable storage medium
US10389687B2 (en) Secure document transmission
US20140053263A1 (en) System, method and computer program product for sending information extracted from a potentially unwanted data sample to generate a signature
CN110826068A (en) Safety detection method and safety detection system
CN113472803A (en) Vulnerability attack state detection method and device, computer equipment and storage medium
US20240089298A1 (en) System for processing content in scan and remediation processing
CN113486350A (en) Malicious software identification method, device, equipment and storage medium
CN115221524A (en) Service data protection method, device, equipment and storage medium
KR102473312B1 (en) System and method for protecting leaked information
CN108848165B (en) Service request processing method and device, computer equipment and storage medium
CN112231701A (en) PDF file processing method and device
CN108667919B (en) Data processing method, data processing device, computer equipment and storage medium
CN116015777A (en) Document detection method, device, equipment and storage medium
CN115310059A (en) Data security processing method and device
CN111191235B (en) Suspicious file analysis method, suspicious file analysis device and computer readable storage medium
CN114218578A (en) Method and device for generating threat information
US8627099B2 (en) System, method and computer program product for removing null values during scanning
CN112612750A (en) File content processing method and device, electronic equipment and readable storage medium
CN111695327A (en) Method and device for repairing messy codes, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210115