CN112685994B

CN112685994B - Double-layer PDF file style formatting output method, device, equipment and medium

Info

Publication number: CN112685994B
Application number: CN202011421689.7A
Authority: CN
Inventors: 黄敬林; 庄莉; 梁懿; 林振天; 池少宁
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Siji Location Service Co ltd; State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2023-02-21
Anticipated expiration: 2040-12-08
Also published as: CN112685994A

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for format output of a double-layer PDF file style, wherein a double-layer PDF file is analyzed through a pdfbox toolkit based on Java, content analysis and format extraction are carried out on analyzed labels of the double-layer PDF file page by page, then the PDF file content is output according to a standard paragraph format, effective format processing of text content is realized, a good tool is provided for data preprocessing of intelligent analysis of official documents, the time for content processing is greatly saved, and manual intervention is reduced.

Description

Double-layer PDF file style formatting output method, device, equipment and medium

Technical Field

The invention relates to the technical field of PDF file processing, in particular to a method, a device and a medium for formatting and outputting a double-layer PDF file style.

Background

In a collaborative office system, a large number of external paper receipts are basically converted into a double-layer PDF file for circulation through scanning and OCR recognition, and the text content in the double-layer PDF file is only slightly reduced text information without specific information content such as paragraph formats of the file. A great deal of problems are brought to the analysis of later official documents. The traditional method mainly extracts the content of the text in the PDF, analyzes and reconstructs the basic structure of the document by methods of punctuation marks, semantic information of the text and the like, and the method is difficult to restore the format information of the original document and brings difficulty for extracting the metadata information of the document based on the style.

The existing double-layer PDF style formatting mainly stays at the semantic analysis level aiming at the extracted file content, and the specific style content originally displayed by the PDF file is difficult to restore, and the method comprises the following steps: the format information such as the title, the text number, the title, the paragraph content and the like brings huge obstacles to extracting the content according to the style based on the PDF document.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus, a device and a medium for outputting a double-layer PDF file style format, which are used for parsing a double-layer PDF document through a PDF toolkit based on Java, so as to output PDF document contents according to a standard paragraph format.

In a first aspect, the present invention provides a method for outputting a format of a double-layer PDF file style, including:

step a, loading a double-layer PDF file, and triggering the analysis of the content of the file to obtain a text label;

b, analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;

step c, caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;

step d, acquiring and caching the thickened text to obtain a thickened cache queue;

step e, when each page is finished, sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label;

step f, according to the thickening cache queue, marking the thickening text content in the text label as thickening information;

step g, sequentially caching all text labels of the current page, and calling a label output method to output contents to obtain the contents of the current page;

h, removing special information in the content of the current page to obtain a text of the current page;

step i, extracting characters in a first character size interval of a character size of the text in the current page as information of a text sending unit, and then outputting the information of the text sending unit;

j, extracting the character number information from the current page text through a regular expression, and then outputting the character number information;

step k, extracting characters of the text font size in the second font size interval from the current page text as the subject name information, and then outputting the subject name information;

step l, when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate, marking the paragraph end and starting a new paragraph;

step m, extracting the text of the current page segment by segment according to the paragraph marks, and calculating the word number of the text;

n, judging whether the document is finished or not, if the document is not finished, marking the end mark of the page as no, then returning to the step b, and if the document is finished, executing the step o;

and step o, outputting the total number of the characters and the total number of the pages of the double-layer PDF file.

Further, in the step h, the special information includes a page number, a special identifier, a pure number, contents behind a blank page, and contents in a horizontal interface

Further, the first word size interval comprises 35-55 intervals and 20-30 intervals.

Further, the second word size interval includes 19-25 intervals and 12-14.5 intervals.

In a second aspect, the present invention provides a dual-layer PDF file style formatting output device, including: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a bold content caching module, a rearranging module, a bold text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;

the loading and analyzing module is used for loading the double-layer PDF file and triggering the analysis of the content of the file to obtain a text label;

the page starting module is used for analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;

the coordinate caching module is used for caching transverse coordinates and longitudinal coordinates corresponding to all text labels of the current page;

the thickening content caching module is used for acquiring and caching the thickening text to obtain a thickening caching queue;

the rearrangement module is used for sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label when each page is finished;

the bold text identification module is used for identifying the content of the bold text in the text label as bold information according to the bold cache queue;

the tag output module is used for sequentially caching all text tags of the current page, calling a tag output method to output contents, and obtaining the contents of the current page;

the special information removing module is used for removing the special information in the content of the current page to obtain a text of the current page;

the text sending unit extracting module is used for extracting characters in a first character size interval of a text character size from the current page text as text sending unit information and then outputting the text sending unit information;

the text number extraction module is used for extracting text number information from the current page text through a regular expression and then outputting the text number information;

the question name extraction module extracts characters of a text font number in a second font number interval from the current page text as question name information, and then outputs the question name information;

the paragraph judgment module is used for marking the end of a paragraph and starting a new paragraph when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate;

the paragraph extracting module is used for extracting the text of the current page section by section according to the paragraph mark and calculating the word number of the text;

the document ending judging module is used for judging whether the document is ended or not, if the document is not ended, the page ending mark is marked as no, then the page starting module is returned, and if the document is ended, the result output module is executed;

and the output result module is used for outputting the total number of the characters and the total page number of the double-layer PDF file.

Further, the special information comprises a page number, a special identification, a pure number, empty page back content and transverse interface content

Further, the second font size interval includes 19-25 intervals and 12-14.5 intervals.

In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

the method comprises the steps of analyzing a double-layer PDF document through a pdfbox toolkit based on Java, carrying out content analysis and formatting extraction on the analyzed tags of the double-layer PDF document page by page, and then outputting the PDF document content according to a standard paragraph format, so that the effective formatting processing of the text content is realized, a good tool is provided for data preprocessing of the intelligent analysis of the official document, the time for processing the content is greatly saved, and the manual intervention is reduced.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a schematic block diagram of the system of the present invention;

FIG. 2 is a flow chart of a method according to one embodiment of the present invention;

FIG. 3 is a schematic view of a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present description without inventive step, shall fall within the scope of protection of the present application.

Example one

The embodiment provides a method for outputting a format of a dual-layer PDF file style, as shown in fig. 1, including:

step a, loading a double-layer PDF file: loading a double-layer PDF file, and triggering the content analysis of the file to obtain a text label;

pdf file information which is transmitted by the loading hyperlink can be loaded through a pdfbox tool package, and document content analysis is triggered; the existing other pdf tools can also be adopted to realize document content analysis;

step b, starting the page: analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;

step c, caching output content: caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;

the parsed pdf file is usually a text block, and each text block is a text label;

step d, caching the thickened content: acquiring and caching the thickened text to obtain a thickened cache queue;

in order to reserve the bold text information of the double-layer PDF file, whether the text is a bold text is judged, if so, the content of the bold text is cached in a cache until the content of the bold text is completely fetched and then is placed in a bold cache queue, and then the page is triggered to end;

step e, rearranging labels: when each page is finished, sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label; sorting the line information of the text according to the ordinate, and sorting the front and back sequence of the text content according to the abscissa;

step f, marking the thickened content: according to the thickening cache queue, marking the thickening text content in the text label as thickening information;

step g, outputting the content of the book page: sequentially caching all text labels of the current page, and calling a label output method to output contents to obtain the contents of the current page;

calling a label output method to realize that the text label is output into XML data;

h, removing special information: removing special information in the content of the current page to obtain a text of the current page;

in a possible implementation manner, the special information includes information such as a page number, a special identifier, a pure number, contents behind a blank page, and contents of a horizontal interface;

step i, extracting a text sending unit: extracting characters in a first character size interval of a text character size from the current page text to serve as text sending unit information, and then outputting the text sending unit information;

in a possible implementation manner, the first font size interval comprises 35-55 intervals and 20-30 intervals, and when the text font size is in any one of the two intervals, the text font size is extracted as the information of the text sending unit;

step j, extracting the text number: extracting the text number information in the current page text through a regular expression, and then outputting the text number information;

step k, extracting the subject name: extracting characters of the text font size in a second font size interval from the current page text as subject name information, and then outputting the subject name information;

in a possible implementation manner, the second word size interval comprises 19-25 intervals and 12-14.5 intervals, and when the text word size is in any one of the two intervals, the text word size is extracted as the subject name information;

step l, outputting paragraph labels: when the longitudinal interface is converted into a transverse page, the new behavior indented text content or the line tail coordinate is shorter than the actual coordinate, marking the paragraph end and starting a new paragraph;

step m, paragraph content extraction: extracting the text of the current page segment by segment according to the paragraph marks, and calculating the word number of the text;

step n, judging whether the document is finished: if the document is not finished, marking page end identification as no, then returning to the step b, emptying the cache, and executing the analysis of the next page of document; if the document is finished, executing the step o;

step o, outputting the number of words and the number of pages of the document: and outputting the total number of the characters and the total number of pages of the double-layer PDF file.

The output content is expressed in XML, and is divided into a text unit < ansurance unit >, a text number < wh >, a title < subject >, a paragraph < paragraph >, a total number of characters < word number >, a total number of pages < pageCount >, and the like, and the output result expressed by the dual-layer PDF format of the embodiment is shown in fig. 2, and the output content includes: a text unit, a title, paragraph information 1, paragraph information 2, paragraph information 3, \8230;, paragraph information n, a total number of characters, and a total number of pages.

In the embodiment of the invention, the extraction of the text content of the existing official document is realized by the official document field, the position and the semantic information in the perspective of the visual style of the official document, the conversion of the official document content into formatted XML information is realized, the presentation of the information such as PDF official document metadata, paragraph information, page number and the like in the XML format is realized, and good application materials are provided for the intelligent application of the document.

Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, and the detailed description is given in the second embodiment.

Example two

In this embodiment, there is provided a dual-layer PDF file style formatting output device, as shown in fig. 3, including: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a thickening content caching module, a rearranging module, a thickening text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;

the loading analysis module is used for loading a double-layer PDF file and triggering the analysis of the content of the file to obtain a text label;

the coordinate caching module is used for caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;

the tag output module is used for sequentially caching all text tags of the current page, calling a tag output method to output content and obtaining the content of the current page;

the character number extracting module is used for extracting character number information from the current page text through a regular expression and then outputting the character number information;

the question name extraction module extracts characters of a text font size in a second font size interval from the current page text as question name information, and then outputs the question name information;

In one possible implementation, the special information includes a page number, a special identifier, a pure number, empty page back content, and horizontal interface content

In one possible implementation, the first font size interval includes 35-55 intervals and 20-30 intervals.

In one possible implementation, the method is characterized in that: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.

EXAMPLE III

The embodiment provides an electronic device, as shown in fig. 4, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, any one of the first embodiment modes may be implemented.

Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the methods in the embodiments of the present application is within the scope of the present application.

Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.

Example four

The present embodiment provides a computer-readable storage medium, as shown in fig. 5, on which a computer program is stored, and when the computer program is executed by a processor, any one of the embodiments can be implemented.

The embodiment of the specification analyzes a double-layer PDF document through a pdfbox toolkit based on Java, performs content analysis and formatting extraction on the analyzed tags of the double-layer PDF document page by page, and then outputs the PDF document content according to a standard paragraph format, so that the effective formatting processing of the text content is realized, a good tool is provided for data preprocessing of intelligent analysis of official documents, the time for processing the content is greatly saved, and manual intervention is reduced.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A double-layer PDF file style formatting output method is characterized by comprising the following steps:

step c, caching the horizontal coordinates and the vertical coordinates corresponding to all the text labels of the current page;

h, removing special information in the content of the current page to obtain a text of the current page; the special information comprises a page number, a special identifier, a pure number, content behind an empty page and transverse interface content;

2. The method of claim 1, wherein: the first word size interval comprises 35-55 intervals and 20-30 intervals.

3. The method of claim 1, wherein: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.

4. A double-layer PDF file style formatting output device is characterized in that: the method comprises the following steps: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a thickening content caching module, a rearranging module, a thickening text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;

the special information removing module is used for removing the special information in the content of the current page to obtain a text of the current page; the special information comprises a page number, a special identifier, a pure number, content behind an empty page and transverse interface content;

the paragraph extraction module is used for extracting the text of the current page segment by segment according to the paragraph marks and calculating the word number of the text;

5. The apparatus of claim 4, wherein: the first word size interval comprises 35-55 intervals and 20-30 intervals.

6. The apparatus of claim 4, wherein: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the program.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.