CN110765826A

CN110765826A - Method and device for identifying messy codes in Portable Document Format (PDF)

Info

Publication number: CN110765826A
Application number: CN201810852497.8A
Authority: CN
Inventors: 邓斌
Original assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd; Guangzhou Jinshan Mobile Technology Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd; Guangzhou Jinshan Mobile Technology Co Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2020-02-07

Abstract

The embodiment of the invention discloses a method and a device for identifying messy codes in portable document format PDF, wherein the method comprises the following steps: inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes. By the scheme of the embodiment, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.

Description

Method and device for identifying messy codes in Portable Document Format (PDF)

Technical Field

The embodiment of the invention relates to a document processing technology, in particular to a method and a device for identifying messy codes in a Portable Document Format (PDF).

Background

In portable document format PDF documents, messy codes appear when the documents are opened, usually due to missing fonts, incorrect character codes and the like, which brings much trouble to users.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present invention provide a method and an apparatus for recognizing a messy code in a PDF, which can detect whether a user has a messy code in a PDF document, prepare for subsequent messy code repair, and improve user experience.

To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a method for identifying a scrambled code in PDF, where the method may include:

inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes.

Optionally, the method may further include: before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.

Optionally, the method may further include: before a PDF document to be identified is input into a pre-trained neural network model, acquiring a PDF document marked with a messy code; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the untrained neural network model, so that the neural network model has the function of recognizing the messy codes.

Optionally, the method may further include: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.

Optionally, the neural network model may include: tensorflow.

In order to achieve the object of the embodiment of the present invention, an embodiment of the present invention further provides a device for recognizing a scrambled code in PDF, where the device may include: an identification module;

and the identification module is used for inputting the PDF document to be identified into a pre-trained neural network model so as to identify the messy codes in the PDF document to be identified through the neural network model and output an identification result marked with the messy codes.

Optionally, the identification module is further configured to:

before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.

Optionally, the apparatus may further include: a training module;

the training module is used for acquiring the PDF document marked with the messy codes before the identification module inputs the PDF document to be identified into the pre-trained neural network model; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.

Optionally, the training module is further configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.

Optionally, the neural network model may include: tensorflow.

The embodiment of the invention can comprise the following steps: inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes. By the scheme of the embodiment, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.

Additional features and advantages of embodiments of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the examples of the application do not constitute a limitation of the invention.

FIG. 1 is a flow chart of a method for recognizing a scrambled code in a PDF according to an embodiment of the present invention;

fig. 2 is a block diagram of a scrambling code identification apparatus in PDF according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a method for identifying a scrambled code in PDF, as shown in fig. 1, where the method may include S101:

s101, inputting the PDF document to be recognized into a pre-trained neural network model, recognizing the messy codes in the PDF document to be recognized through the neural network model, and outputting a recognition result marked with the messy codes.

In the embodiment of the invention, messy codes often appear on the current PDF document due to various reasons when the current PDF document is opened, so that much inconvenience is brought to work of people, and in order to restore the document, professional workers need to search for areas or reasons with messy codes, so that the workload of the workers is increased.

In the embodiment of the invention, the identification of the PDF messy codes is realized mainly through a trained neural network model, and specifically, a PDF document to be identified can be input into the trained neural network model and an identification result marked with the messy codes is output.

In the embodiment of the invention, according to different used neural network models, the PDF document can be directly input into the neural network model, or the PDF document to be identified can be converted into a picture format and input into the neural network model.

In the embodiment of the present invention, a picture may be generated from each page of PDF document, or a picture may be generated from multiple pages of PDF documents, or a picture may be generated from a part of a page of PDF document. Specifically, it may be determined which documents or which part of documents are to be used to generate the picture according to a preset ratio.

In the embodiment of the invention, when a user opens a PDF document, the system can automatically convert the currently opened page into a picture format and upload the picture format to a trained neural network model in the server, and the server returns whether the PDF document has messy codes or not. If the messy codes are detected, the user can be prompted in a prompt box or the like to try to repair the PDF document.

In the embodiment of the present invention, in order to recognize a random code in a PDF document through a trained neural network model, a neural network model needs to be trained in advance, so that the neural network model has a function of recognizing the random code, thereby obtaining the trained neural network model.

In the embodiment of the invention, before training, a large number of PDF documents can be prepared in advance, messy codes in the PDF documents are labeled in advance, and the selected neural network model is trained through the labeled PDF documents.

In the embodiment of the invention, the messy codes in the PDF document for training can be marked out in a manual marking mode.

In the embodiment of the invention, according to different used neural network models, the marked PDF document can be directly input into the neural network model to train the neural network model, and the marked PDF document can also be converted into a picture format to train the neural network model.

In the embodiment of the present invention, as the same reason, each page of PDF document may be generated into a picture, or multiple pages of PDF documents may be generated into a picture, or a part of a page of PDF document may be generated into a picture. Specifically, it may be determined which documents or which part of documents are to be used to generate the picture according to a preset ratio. For example, the content of a certain number of pages of documents (e.g., 20-30% of the pages) is converted into a picture, whether the documents are scrambled or not is marked, and a large amount of the data is put into the neural network model to be trained for training.

Optionally, the neural network model may include, but is not limited to: tensorflow.

The TensorFlow can be used in the field of deep learning of multiple machines such as voice recognition or image recognition, various improvements are made on a deep learning infrastructure DistBeief developed in 2011, and the TensorFlow can be operated on various devices such as a smart phone and thousands of data center servers.

In the embodiment of the invention, the TensorFlow can be trained through a large number of PDF documents marked with messy codes in advance, and the messy codes of the PDF documents to be recognized can be recognized after the TensorFlow is trained.

By the scheme of the embodiment of the invention, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.

In order to achieve the purpose of the embodiment of the present invention, an apparatus 1 for identifying a scrambled code in PDF is further provided in the embodiment of the present invention, and it should be noted that any embodiment scheme in the foregoing method embodiment is applicable to the apparatus embodiment, and is not described in detail herein. As shown in fig. 2, the apparatus may include: an identification module 11;

the identification module 11 is configured to input the PDF document to be identified into a pre-trained neural network model, so as to identify a messy code in the PDF document to be identified through the neural network model, and output an identification result marked with the messy code.

Optionally, the identification module 11 may be further configured to:

Optionally, the apparatus may further include: a training module 12;

the training module 12 is configured to obtain a PDF document marked with a messy code before the recognition module inputs the PDF document to be recognized into a pre-trained neural network model; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.

Optionally, training module 12 may also be configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.

Optionally, the neural network model may include: tensorflow.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method for identifying messy codes in Portable Document Format (PDF) is characterized by comprising the following steps:

inputting a PDF document to be identified into a pre-trained neural network model, identifying a messy code in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy code.

2. The method according to claim 1, wherein the method further comprises: before the PDF document to be recognized is input into a pre-trained neural network model, opening the PDF document to be recognized, and converting the PDF document to be recognized into a picture format.

3. The method according to claim 1, wherein the method further comprises: before the PDF document to be identified is input into a pre-trained neural network model, acquiring a PDF document marked with messy codes; inputting the PDF document marked with the messy codes into an untrained neural network model to train the untrained neural network model, so that the neural network model has the function of recognizing the messy codes.

4. The method according to claim 3, wherein the method further comprises: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model and train the neural network model.

5. The method according to claim 2 or 4, wherein the neural network model comprises: tensorflow.

6. An apparatus for recognizing a scrambling code in a Portable Document Format (PDF), the apparatus comprising: an identification module;

the identification module is used for inputting the PDF document to be identified into a pre-trained neural network model, so that the messy codes in the PDF document to be identified are identified through the neural network model, and the identification result marked with the messy codes is output.

7. The apparatus according to claim 6, wherein the identifying module is further configured to:

before the PDF document to be recognized is input into a pre-trained neural network model, opening the PDF document to be recognized, and converting the PDF document to be recognized into a picture format.

8. The apparatus according to claim 6, wherein the apparatus further comprises: a training module;

the training module is used for acquiring the PDF document marked with the messy codes before the identification module inputs the PDF document to be identified into a pre-trained neural network model; inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.

9. The apparatus according to claim 8, wherein the training module is further configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model and train the neural network model.

10. The apparatus according to claim 7 or 9, wherein the neural network model comprises: tensorflow.