CN110765826A - Method and device for identifying messy codes in Portable Document Format (PDF) - Google Patents

Method and device for identifying messy codes in Portable Document Format (PDF) Download PDF

Info

Publication number
CN110765826A
CN110765826A CN201810852497.8A CN201810852497A CN110765826A CN 110765826 A CN110765826 A CN 110765826A CN 201810852497 A CN201810852497 A CN 201810852497A CN 110765826 A CN110765826 A CN 110765826A
Authority
CN
China
Prior art keywords
neural network
network model
pdf document
pdf
messy codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810852497.8A
Other languages
Chinese (zh)
Inventor
邓斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Jinshan Mobile Technology Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Jinshan Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Guangzhou Jinshan Mobile Technology Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN201810852497.8A priority Critical patent/CN110765826A/en
Publication of CN110765826A publication Critical patent/CN110765826A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for identifying messy codes in portable document format PDF, wherein the method comprises the following steps: inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes. By the scheme of the embodiment, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.

Description

Method and device for identifying messy codes in Portable Document Format (PDF)
Technical Field
The embodiment of the invention relates to a document processing technology, in particular to a method and a device for identifying messy codes in a Portable Document Format (PDF).
Background
In portable document format PDF documents, messy codes appear when the documents are opened, usually due to missing fonts, incorrect character codes and the like, which brings much trouble to users.
Disclosure of Invention
In order to solve the technical problem, embodiments of the present invention provide a method and an apparatus for recognizing a messy code in a PDF, which can detect whether a user has a messy code in a PDF document, prepare for subsequent messy code repair, and improve user experience.
To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a method for identifying a scrambled code in PDF, where the method may include:
inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes.
Optionally, the method may further include: before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.
Optionally, the method may further include: before a PDF document to be identified is input into a pre-trained neural network model, acquiring a PDF document marked with a messy code; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the untrained neural network model, so that the neural network model has the function of recognizing the messy codes.
Optionally, the method may further include: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.
Optionally, the neural network model may include: tensorflow.
In order to achieve the object of the embodiment of the present invention, an embodiment of the present invention further provides a device for recognizing a scrambled code in PDF, where the device may include: an identification module;
and the identification module is used for inputting the PDF document to be identified into a pre-trained neural network model so as to identify the messy codes in the PDF document to be identified through the neural network model and output an identification result marked with the messy codes.
Optionally, the identification module is further configured to:
before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.
Optionally, the apparatus may further include: a training module;
the training module is used for acquiring the PDF document marked with the messy codes before the identification module inputs the PDF document to be identified into the pre-trained neural network model; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.
Optionally, the training module is further configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.
Optionally, the neural network model may include: tensorflow.
The embodiment of the invention can comprise the following steps: inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes. By the scheme of the embodiment, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.
Additional features and advantages of embodiments of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the examples of the application do not constitute a limitation of the invention.
FIG. 1 is a flow chart of a method for recognizing a scrambled code in a PDF according to an embodiment of the present invention;
fig. 2 is a block diagram of a scrambling code identification apparatus in PDF according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
To achieve the object of the embodiment of the present invention, an embodiment of the present invention provides a method for identifying a scrambled code in PDF, as shown in fig. 1, where the method may include S101:
s101, inputting the PDF document to be recognized into a pre-trained neural network model, recognizing the messy codes in the PDF document to be recognized through the neural network model, and outputting a recognition result marked with the messy codes.
In the embodiment of the invention, messy codes often appear on the current PDF document due to various reasons when the current PDF document is opened, so that much inconvenience is brought to work of people, and in order to restore the document, professional workers need to search for areas or reasons with messy codes, so that the workload of the workers is increased.
In the embodiment of the invention, the identification of the PDF messy codes is realized mainly through a trained neural network model, and specifically, a PDF document to be identified can be input into the trained neural network model and an identification result marked with the messy codes is output.
Optionally, the method may further include: before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.
In the embodiment of the invention, according to different used neural network models, the PDF document can be directly input into the neural network model, or the PDF document to be identified can be converted into a picture format and input into the neural network model.
In the embodiment of the present invention, a picture may be generated from each page of PDF document, or a picture may be generated from multiple pages of PDF documents, or a picture may be generated from a part of a page of PDF document. Specifically, it may be determined which documents or which part of documents are to be used to generate the picture according to a preset ratio.
In the embodiment of the invention, when a user opens a PDF document, the system can automatically convert the currently opened page into a picture format and upload the picture format to a trained neural network model in the server, and the server returns whether the PDF document has messy codes or not. If the messy codes are detected, the user can be prompted in a prompt box or the like to try to repair the PDF document.
Optionally, the method may further include: before a PDF document to be identified is input into a pre-trained neural network model, acquiring a PDF document marked with a messy code; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the untrained neural network model, so that the neural network model has the function of recognizing the messy codes.
In the embodiment of the present invention, in order to recognize a random code in a PDF document through a trained neural network model, a neural network model needs to be trained in advance, so that the neural network model has a function of recognizing the random code, thereby obtaining the trained neural network model.
In the embodiment of the invention, before training, a large number of PDF documents can be prepared in advance, messy codes in the PDF documents are labeled in advance, and the selected neural network model is trained through the labeled PDF documents.
In the embodiment of the invention, the messy codes in the PDF document for training can be marked out in a manual marking mode.
Optionally, the method may further include: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.
In the embodiment of the invention, according to different used neural network models, the marked PDF document can be directly input into the neural network model to train the neural network model, and the marked PDF document can also be converted into a picture format to train the neural network model.
In the embodiment of the present invention, as the same reason, each page of PDF document may be generated into a picture, or multiple pages of PDF documents may be generated into a picture, or a part of a page of PDF document may be generated into a picture. Specifically, it may be determined which documents or which part of documents are to be used to generate the picture according to a preset ratio. For example, the content of a certain number of pages of documents (e.g., 20-30% of the pages) is converted into a picture, whether the documents are scrambled or not is marked, and a large amount of the data is put into the neural network model to be trained for training.
Optionally, the neural network model may include, but is not limited to: tensorflow.
The TensorFlow can be used in the field of deep learning of multiple machines such as voice recognition or image recognition, various improvements are made on a deep learning infrastructure DistBeief developed in 2011, and the TensorFlow can be operated on various devices such as a smart phone and thousands of data center servers.
In the embodiment of the invention, the TensorFlow can be trained through a large number of PDF documents marked with messy codes in advance, and the messy codes of the PDF documents to be recognized can be recognized after the TensorFlow is trained.
By the scheme of the embodiment of the invention, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.
In order to achieve the purpose of the embodiment of the present invention, an apparatus 1 for identifying a scrambled code in PDF is further provided in the embodiment of the present invention, and it should be noted that any embodiment scheme in the foregoing method embodiment is applicable to the apparatus embodiment, and is not described in detail herein. As shown in fig. 2, the apparatus may include: an identification module 11;
the identification module 11 is configured to input the PDF document to be identified into a pre-trained neural network model, so as to identify a messy code in the PDF document to be identified through the neural network model, and output an identification result marked with the messy code.
Optionally, the identification module 11 may be further configured to:
before the PDF document to be identified is input into a pre-trained neural network model, the PDF document to be identified is opened, and the PDF document to be identified is converted into a picture format.
Optionally, the apparatus may further include: a training module 12;
the training module 12 is configured to obtain a PDF document marked with a messy code before the recognition module inputs the PDF document to be recognized into a pre-trained neural network model; and inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.
Optionally, training module 12 may also be configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model, and training the neural network model.
Optionally, the neural network model may include: tensorflow.
The embodiment of the invention can comprise the following steps: inputting the PDF document to be identified into a pre-trained neural network model, identifying the messy codes in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy codes. By the scheme of the embodiment, whether the user has the messy codes in the PDF document is detected, preparation is made for subsequent messy code repair, and the user experience is improved.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (10)

1. A method for identifying messy codes in Portable Document Format (PDF) is characterized by comprising the following steps:
inputting a PDF document to be identified into a pre-trained neural network model, identifying a messy code in the PDF document to be identified through the neural network model, and outputting an identification result marked with the messy code.
2. The method according to claim 1, wherein the method further comprises: before the PDF document to be recognized is input into a pre-trained neural network model, opening the PDF document to be recognized, and converting the PDF document to be recognized into a picture format.
3. The method according to claim 1, wherein the method further comprises: before the PDF document to be identified is input into a pre-trained neural network model, acquiring a PDF document marked with messy codes; inputting the PDF document marked with the messy codes into an untrained neural network model to train the untrained neural network model, so that the neural network model has the function of recognizing the messy codes.
4. The method according to claim 3, wherein the method further comprises: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model and train the neural network model.
5. The method according to claim 2 or 4, wherein the neural network model comprises: tensorflow.
6. An apparatus for recognizing a scrambling code in a Portable Document Format (PDF), the apparatus comprising: an identification module;
the identification module is used for inputting the PDF document to be identified into a pre-trained neural network model, so that the messy codes in the PDF document to be identified are identified through the neural network model, and the identification result marked with the messy codes is output.
7. The apparatus according to claim 6, wherein the identifying module is further configured to:
before the PDF document to be recognized is input into a pre-trained neural network model, opening the PDF document to be recognized, and converting the PDF document to be recognized into a picture format.
8. The apparatus according to claim 6, wherein the apparatus further comprises: a training module;
the training module is used for acquiring the PDF document marked with the messy codes before the identification module inputs the PDF document to be identified into a pre-trained neural network model; inputting the PDF document marked with the messy codes into an untrained neural network model to train the neural network model, so that the neural network model has the function of recognizing the messy codes.
9. The apparatus according to claim 8, wherein the training module is further configured to: before the PDF document marked with the messy codes is input into an untrained neural network model, converting the content of the PDF document with a preset number of pages or the content of a preset proportion in one document into a picture so as to input the picture into the neural network model and train the neural network model.
10. The apparatus according to claim 7 or 9, wherein the neural network model comprises: tensorflow.
CN201810852497.8A 2018-07-27 2018-07-27 Method and device for identifying messy codes in Portable Document Format (PDF) Withdrawn CN110765826A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810852497.8A CN110765826A (en) 2018-07-27 2018-07-27 Method and device for identifying messy codes in Portable Document Format (PDF)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810852497.8A CN110765826A (en) 2018-07-27 2018-07-27 Method and device for identifying messy codes in Portable Document Format (PDF)

Publications (1)

Publication Number Publication Date
CN110765826A true CN110765826A (en) 2020-02-07

Family

ID=69328465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810852497.8A Withdrawn CN110765826A (en) 2018-07-27 2018-07-27 Method and device for identifying messy codes in Portable Document Format (PDF)

Country Status (1)

Country Link
CN (1) CN110765826A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158745A (en) * 2021-02-02 2021-07-23 北京惠朗时代科技有限公司 Disorder code document picture identification method and system based on multi-feature operator

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN104732228A (en) * 2015-04-16 2015-06-24 同方知网数字出版技术股份有限公司 Detection and correction method for messy codes of PDF (portable document format) document
CN108154191A (en) * 2018-01-12 2018-06-12 北京经舆典网络科技有限公司 The recognition methods of file and picture and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN104732228A (en) * 2015-04-16 2015-06-24 同方知网数字出版技术股份有限公司 Detection and correction method for messy codes of PDF (portable document format) document
CN108154191A (en) * 2018-01-12 2018-06-12 北京经舆典网络科技有限公司 The recognition methods of file and picture and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158745A (en) * 2021-02-02 2021-07-23 北京惠朗时代科技有限公司 Disorder code document picture identification method and system based on multi-feature operator
CN113158745B (en) * 2021-02-02 2024-04-02 北京惠朗时代科技有限公司 Multi-feature operator-based messy code document picture identification method and system

Similar Documents

Publication Publication Date Title
CN109195007B (en) Video generation method, device, server and computer readable storage medium
US11270099B2 (en) Method and apparatus for generating facial feature
US10796685B2 (en) Method and device for image recognition
CN112669515A (en) Bill image recognition method and device, electronic equipment and storage medium
CN110765740B (en) Full-type text replacement method, system, device and storage medium based on DOM tree
CN110705235B (en) Information input method and device for business handling, storage medium and electronic equipment
CN113010638A (en) Entity recognition model generation method and device and entity extraction method and device
CN112686243A (en) Method and device for intelligently identifying picture characters, computer equipment and storage medium
CN113177435A (en) Test paper analysis method and device, storage medium and electronic equipment
CN110866457A (en) Electronic insurance policy obtaining method and device, computer equipment and storage medium
CN112861864A (en) Topic entry method, topic entry device, electronic device and computer-readable storage medium
CN114359533B (en) Page number identification method based on page text and computer equipment
CN110825874A (en) Chinese text classification method and device and computer readable storage medium
CN114386013A (en) Automatic student status authentication method and device, computer equipment and storage medium
CN110400560B (en) Data processing method and device, storage medium and electronic device
CN110765826A (en) Method and device for identifying messy codes in Portable Document Format (PDF)
CN111881900B (en) Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN117274969A (en) Seal identification method, device, equipment and medium
CN115273057A (en) Text recognition method and device, dictation correction method and device and electronic equipment
CN115311664A (en) Method, device, medium and equipment for identifying text type in image
CN111260757A (en) Image processing method and device and terminal equipment
CN114359931A (en) Express bill identification method and device, computer equipment and storage medium
CN117390681A (en) Image information desensitizing method, device, equipment and storage medium
CN115712887B (en) Picture verification code identification method and device, electronic equipment and storage medium
CN117197830A (en) Signing and receiving order information identification method, device, medium and equipment based on terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200207