CN112685994B - Double-layer PDF file style formatting output method, device, equipment and medium - Google Patents

Double-layer PDF file style formatting output method, device, equipment and medium Download PDF

Info

Publication number
CN112685994B
CN112685994B CN202011421689.7A CN202011421689A CN112685994B CN 112685994 B CN112685994 B CN 112685994B CN 202011421689 A CN202011421689 A CN 202011421689A CN 112685994 B CN112685994 B CN 112685994B
Authority
CN
China
Prior art keywords
text
page
module
content
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011421689.7A
Other languages
Chinese (zh)
Other versions
CN112685994A (en
Inventor
黄敬林
庄莉
梁懿
林振天
池少宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Siji Location Service Co ltd
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Information and Telecommunication Co Ltd, Fujian Yirong Information Technology Co Ltd, Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd filed Critical State Grid Information and Telecommunication Co Ltd
Priority to CN202011421689.7A priority Critical patent/CN112685994B/en
Publication of CN112685994A publication Critical patent/CN112685994A/en
Application granted granted Critical
Publication of CN112685994B publication Critical patent/CN112685994B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for format output of a double-layer PDF file style, wherein a double-layer PDF file is analyzed through a pdfbox toolkit based on Java, content analysis and format extraction are carried out on analyzed labels of the double-layer PDF file page by page, then the PDF file content is output according to a standard paragraph format, effective format processing of text content is realized, a good tool is provided for data preprocessing of intelligent analysis of official documents, the time for content processing is greatly saved, and manual intervention is reduced.

Description

Double-layer PDF file style formatting output method, device, equipment and medium
Technical Field
The invention relates to the technical field of PDF file processing, in particular to a method, a device and a medium for formatting and outputting a double-layer PDF file style.
Background
In a collaborative office system, a large number of external paper receipts are basically converted into a double-layer PDF file for circulation through scanning and OCR recognition, and the text content in the double-layer PDF file is only slightly reduced text information without specific information content such as paragraph formats of the file. A great deal of problems are brought to the analysis of later official documents. The traditional method mainly extracts the content of the text in the PDF, analyzes and reconstructs the basic structure of the document by methods of punctuation marks, semantic information of the text and the like, and the method is difficult to restore the format information of the original document and brings difficulty for extracting the metadata information of the document based on the style.
The existing double-layer PDF style formatting mainly stays at the semantic analysis level aiming at the extracted file content, and the specific style content originally displayed by the PDF file is difficult to restore, and the method comprises the following steps: the format information such as the title, the text number, the title, the paragraph content and the like brings huge obstacles to extracting the content according to the style based on the PDF document.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method, an apparatus, a device and a medium for outputting a double-layer PDF file style format, which are used for parsing a double-layer PDF document through a PDF toolkit based on Java, so as to output PDF document contents according to a standard paragraph format.
In a first aspect, the present invention provides a method for outputting a format of a double-layer PDF file style, including:
step a, loading a double-layer PDF file, and triggering the analysis of the content of the file to obtain a text label;
b, analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
step c, caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;
step d, acquiring and caching the thickened text to obtain a thickened cache queue;
step e, when each page is finished, sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label;
step f, according to the thickening cache queue, marking the thickening text content in the text label as thickening information;
step g, sequentially caching all text labels of the current page, and calling a label output method to output contents to obtain the contents of the current page;
h, removing special information in the content of the current page to obtain a text of the current page;
step i, extracting characters in a first character size interval of a character size of the text in the current page as information of a text sending unit, and then outputting the information of the text sending unit;
j, extracting the character number information from the current page text through a regular expression, and then outputting the character number information;
step k, extracting characters of the text font size in the second font size interval from the current page text as the subject name information, and then outputting the subject name information;
step l, when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate, marking the paragraph end and starting a new paragraph;
step m, extracting the text of the current page segment by segment according to the paragraph marks, and calculating the word number of the text;
n, judging whether the document is finished or not, if the document is not finished, marking the end mark of the page as no, then returning to the step b, and if the document is finished, executing the step o;
and step o, outputting the total number of the characters and the total number of the pages of the double-layer PDF file.
Further, in the step h, the special information includes a page number, a special identifier, a pure number, contents behind a blank page, and contents in a horizontal interface
Further, the first word size interval comprises 35-55 intervals and 20-30 intervals.
Further, the second word size interval includes 19-25 intervals and 12-14.5 intervals.
In a second aspect, the present invention provides a dual-layer PDF file style formatting output device, including: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a bold content caching module, a rearranging module, a bold text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;
the loading and analyzing module is used for loading the double-layer PDF file and triggering the analysis of the content of the file to obtain a text label;
the page starting module is used for analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
the coordinate caching module is used for caching transverse coordinates and longitudinal coordinates corresponding to all text labels of the current page;
the thickening content caching module is used for acquiring and caching the thickening text to obtain a thickening caching queue;
the rearrangement module is used for sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label when each page is finished;
the bold text identification module is used for identifying the content of the bold text in the text label as bold information according to the bold cache queue;
the tag output module is used for sequentially caching all text tags of the current page, calling a tag output method to output contents, and obtaining the contents of the current page;
the special information removing module is used for removing the special information in the content of the current page to obtain a text of the current page;
the text sending unit extracting module is used for extracting characters in a first character size interval of a text character size from the current page text as text sending unit information and then outputting the text sending unit information;
the text number extraction module is used for extracting text number information from the current page text through a regular expression and then outputting the text number information;
the question name extraction module extracts characters of a text font number in a second font number interval from the current page text as question name information, and then outputs the question name information;
the paragraph judgment module is used for marking the end of a paragraph and starting a new paragraph when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate;
the paragraph extracting module is used for extracting the text of the current page section by section according to the paragraph mark and calculating the word number of the text;
the document ending judging module is used for judging whether the document is ended or not, if the document is not ended, the page ending mark is marked as no, then the page starting module is returned, and if the document is ended, the result output module is executed;
and the output result module is used for outputting the total number of the characters and the total page number of the double-layer PDF file.
Further, the special information comprises a page number, a special identification, a pure number, empty page back content and transverse interface content
Further, the first word size interval comprises 35-55 intervals and 20-30 intervals.
Further, the second font size interval includes 19-25 intervals and 12-14.5 intervals.
In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.
One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:
the method comprises the steps of analyzing a double-layer PDF document through a pdfbox toolkit based on Java, carrying out content analysis and formatting extraction on the analyzed tags of the double-layer PDF document page by page, and then outputting the PDF document content according to a standard paragraph format, so that the effective formatting processing of the text content is realized, a good tool is provided for data preprocessing of the intelligent analysis of the official document, the time for processing the content is greatly saved, and the manual intervention is reduced.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
The invention will be further described with reference to the following examples with reference to the accompanying drawings.
FIG. 1 is a schematic block diagram of the system of the present invention;
FIG. 2 is a flow chart of a method according to one embodiment of the present invention;
FIG. 3 is a schematic view of a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present description without inventive step, shall fall within the scope of protection of the present application.
Example one
The embodiment provides a method for outputting a format of a dual-layer PDF file style, as shown in fig. 1, including:
step a, loading a double-layer PDF file: loading a double-layer PDF file, and triggering the content analysis of the file to obtain a text label;
pdf file information which is transmitted by the loading hyperlink can be loaded through a pdfbox tool package, and document content analysis is triggered; the existing other pdf tools can also be adopted to realize document content analysis;
step b, starting the page: analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
step c, caching output content: caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;
the parsed pdf file is usually a text block, and each text block is a text label;
step d, caching the thickened content: acquiring and caching the thickened text to obtain a thickened cache queue;
in order to reserve the bold text information of the double-layer PDF file, whether the text is a bold text is judged, if so, the content of the bold text is cached in a cache until the content of the bold text is completely fetched and then is placed in a bold cache queue, and then the page is triggered to end;
step e, rearranging labels: when each page is finished, sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label; sorting the line information of the text according to the ordinate, and sorting the front and back sequence of the text content according to the abscissa;
step f, marking the thickened content: according to the thickening cache queue, marking the thickening text content in the text label as thickening information;
step g, outputting the content of the book page: sequentially caching all text labels of the current page, and calling a label output method to output contents to obtain the contents of the current page;
calling a label output method to realize that the text label is output into XML data;
h, removing special information: removing special information in the content of the current page to obtain a text of the current page;
in a possible implementation manner, the special information includes information such as a page number, a special identifier, a pure number, contents behind a blank page, and contents of a horizontal interface;
step i, extracting a text sending unit: extracting characters in a first character size interval of a text character size from the current page text to serve as text sending unit information, and then outputting the text sending unit information;
in a possible implementation manner, the first font size interval comprises 35-55 intervals and 20-30 intervals, and when the text font size is in any one of the two intervals, the text font size is extracted as the information of the text sending unit;
step j, extracting the text number: extracting the text number information in the current page text through a regular expression, and then outputting the text number information;
step k, extracting the subject name: extracting characters of the text font size in a second font size interval from the current page text as subject name information, and then outputting the subject name information;
in a possible implementation manner, the second word size interval comprises 19-25 intervals and 12-14.5 intervals, and when the text word size is in any one of the two intervals, the text word size is extracted as the subject name information;
step l, outputting paragraph labels: when the longitudinal interface is converted into a transverse page, the new behavior indented text content or the line tail coordinate is shorter than the actual coordinate, marking the paragraph end and starting a new paragraph;
step m, paragraph content extraction: extracting the text of the current page segment by segment according to the paragraph marks, and calculating the word number of the text;
step n, judging whether the document is finished: if the document is not finished, marking page end identification as no, then returning to the step b, emptying the cache, and executing the analysis of the next page of document; if the document is finished, executing the step o;
step o, outputting the number of words and the number of pages of the document: and outputting the total number of the characters and the total number of pages of the double-layer PDF file.
The output content is expressed in XML, and is divided into a text unit < ansurance unit >, a text number < wh >, a title < subject >, a paragraph < paragraph >, a total number of characters < word number >, a total number of pages < pageCount >, and the like, and the output result expressed by the dual-layer PDF format of the embodiment is shown in fig. 2, and the output content includes: a text unit, a title, paragraph information 1, paragraph information 2, paragraph information 3, \8230;, paragraph information n, a total number of characters, and a total number of pages.
In the embodiment of the invention, the extraction of the text content of the existing official document is realized by the official document field, the position and the semantic information in the perspective of the visual style of the official document, the conversion of the official document content into formatted XML information is realized, the presentation of the information such as PDF official document metadata, paragraph information, page number and the like in the XML format is realized, and good application materials are provided for the intelligent application of the document.
Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, and the detailed description is given in the second embodiment.
Example two
In this embodiment, there is provided a dual-layer PDF file style formatting output device, as shown in fig. 3, including: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a thickening content caching module, a rearranging module, a thickening text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;
the loading analysis module is used for loading a double-layer PDF file and triggering the analysis of the content of the file to obtain a text label;
the page starting module is used for analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
the coordinate caching module is used for caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;
the thickening content caching module is used for acquiring and caching the thickening text to obtain a thickening caching queue;
the rearrangement module is used for sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label when each page is finished;
the bold text identification module is used for identifying the content of the bold text in the text label as bold information according to the bold cache queue;
the tag output module is used for sequentially caching all text tags of the current page, calling a tag output method to output content and obtaining the content of the current page;
the special information removing module is used for removing the special information in the content of the current page to obtain a text of the current page;
the text sending unit extracting module is used for extracting characters in a first character size interval of a text character size from the current page text as text sending unit information and then outputting the text sending unit information;
the character number extracting module is used for extracting character number information from the current page text through a regular expression and then outputting the character number information;
the question name extraction module extracts characters of a text font size in a second font size interval from the current page text as question name information, and then outputs the question name information;
the paragraph judgment module is used for marking the end of a paragraph and starting a new paragraph when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate;
the paragraph extracting module is used for extracting the text of the current page section by section according to the paragraph mark and calculating the word number of the text;
the document ending judging module is used for judging whether the document is ended or not, if the document is not ended, the page ending mark is marked as no, then the page starting module is returned, and if the document is ended, the result output module is executed;
and the output result module is used for outputting the total number of the characters and the total page number of the double-layer PDF file.
In one possible implementation, the special information includes a page number, a special identifier, a pure number, empty page back content, and horizontal interface content
In one possible implementation, the first font size interval includes 35-55 intervals and 20-30 intervals.
In one possible implementation, the method is characterized in that: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.
Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.
EXAMPLE III
The embodiment provides an electronic device, as shown in fig. 4, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, any one of the first embodiment modes may be implemented.
Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the methods in the embodiments of the present application is within the scope of the present application.
Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.
Example four
The present embodiment provides a computer-readable storage medium, as shown in fig. 5, on which a computer program is stored, and when the computer program is executed by a processor, any one of the embodiments can be implemented.
The embodiment of the specification analyzes a double-layer PDF document through a pdfbox toolkit based on Java, performs content analysis and formatting extraction on the analyzed tags of the double-layer PDF document page by page, and then outputs the PDF document content according to a standard paragraph format, so that the effective formatting processing of the text content is realized, a good tool is provided for data preprocessing of intelligent analysis of official documents, the time for processing the content is greatly saved, and manual intervention is reduced.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims (8)

1. A double-layer PDF file style formatting output method is characterized by comprising the following steps:
step a, loading a double-layer PDF file, and triggering the analysis of the content of the file to obtain a text label;
b, analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
step c, caching the horizontal coordinates and the vertical coordinates corresponding to all the text labels of the current page;
step d, acquiring and caching the thickened text to obtain a thickened cache queue;
step e, when each page is finished, sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label;
step f, according to the thickening cache queue, marking the thickening text content in the text label as thickening information;
step g, sequentially caching all text labels of the current page, and calling a label output method to output contents to obtain the contents of the current page;
h, removing special information in the content of the current page to obtain a text of the current page; the special information comprises a page number, a special identifier, a pure number, content behind an empty page and transverse interface content;
step i, extracting characters in a first character size interval of a character size of the text in the current page as information of a text sending unit, and then outputting the information of the text sending unit;
j, extracting the character number information from the current page text through a regular expression, and then outputting the character number information;
step k, extracting characters of the text font size in the second font size interval from the current page text as the subject name information, and then outputting the subject name information;
step l, when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate, marking the paragraph end and starting a new paragraph;
step m, extracting the text of the current page segment by segment according to the paragraph marks, and calculating the word number of the text;
n, judging whether the document is finished or not, if the document is not finished, marking the end mark of the page as no, then returning to the step b, and if the document is finished, executing the step o;
and step o, outputting the total number of the characters and the total number of the pages of the double-layer PDF file.
2. The method of claim 1, wherein: the first word size interval comprises 35-55 intervals and 20-30 intervals.
3. The method of claim 1, wherein: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.
4. A double-layer PDF file style formatting output device is characterized in that: the method comprises the following steps: the system comprises a loading analysis module, a page starting module, a coordinate caching module, a thickening content caching module, a rearranging module, a thickening text identification module, a label output module, a special information removing module, a text sending unit extracting module, a text number extracting module, a question name extracting module, a paragraph judging module, a paragraph extracting module, a document ending judging module and a result output module;
the loading analysis module is used for loading a double-layer PDF file and triggering the analysis of the content of the file to obtain a text label;
the page starting module is used for analyzing the content of the analyzed double-layer PDF file page by page, and recording page number, page width and page height information when each page starts;
the coordinate caching module is used for caching the horizontal coordinates and the vertical coordinates corresponding to all text labels of the current page;
the thickening content caching module is used for acquiring and caching the thickening text to obtain a thickening caching queue;
the rearrangement module is used for sequencing the text labels according to the horizontal coordinate and the vertical coordinate corresponding to each text label when each page is finished;
the bold text identification module is used for identifying the content of the bold text in the text label as bold information according to the bold cache queue;
the tag output module is used for sequentially caching all text tags of the current page, calling a tag output method to output contents, and obtaining the contents of the current page;
the special information removing module is used for removing the special information in the content of the current page to obtain a text of the current page; the special information comprises a page number, a special identifier, a pure number, content behind an empty page and transverse interface content;
the text sending unit extracting module is used for extracting characters in a first character size interval of a text character size from the current page text as text sending unit information and then outputting the text sending unit information;
the character number extracting module is used for extracting character number information from the current page text through a regular expression and then outputting the character number information;
the question name extraction module extracts characters of a text font size in a second font size interval from the current page text as question name information, and then outputs the question name information;
the paragraph judgment module is used for marking the end of a paragraph and starting a new paragraph when the longitudinal interface is converted into a transverse page, the text content with new behavior indentation or the line tail coordinate is shorter than the actual coordinate;
the paragraph extraction module is used for extracting the text of the current page segment by segment according to the paragraph marks and calculating the word number of the text;
the document ending judging module is used for judging whether the document is ended or not, if the document is not ended, the page ending mark is marked as no, then the page starting module is returned, and if the document is ended, the result output module is executed;
and the output result module is used for outputting the total number of the characters and the total page number of the double-layer PDF file.
5. The apparatus of claim 4, wherein: the first word size interval comprises 35-55 intervals and 20-30 intervals.
6. The apparatus of claim 4, wherein: the second word size interval comprises 19-25 intervals and 12-14.5 intervals.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the program.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.
CN202011421689.7A 2020-12-08 2020-12-08 Double-layer PDF file style formatting output method, device, equipment and medium Active CN112685994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011421689.7A CN112685994B (en) 2020-12-08 2020-12-08 Double-layer PDF file style formatting output method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011421689.7A CN112685994B (en) 2020-12-08 2020-12-08 Double-layer PDF file style formatting output method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112685994A CN112685994A (en) 2021-04-20
CN112685994B true CN112685994B (en) 2023-02-21

Family

ID=75446278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011421689.7A Active CN112685994B (en) 2020-12-08 2020-12-08 Double-layer PDF file style formatting output method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112685994B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100496981B1 (en) * 2002-12-18 2005-06-28 삼성에스디에스 주식회사 A PDF Document Providing Method Using XML
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN107622230B (en) * 2017-08-30 2019-12-06 中国科学院软件研究所 PDF table data analysis method based on region identification and segmentation
CN110598189A (en) * 2019-08-14 2019-12-20 中国平安财产保险股份有限公司 Document processing method, device, equipment and readable storage medium
CN110598191B (en) * 2019-11-18 2020-04-07 江苏联著实业股份有限公司 Complex PDF structure analysis method and device based on neural network

Also Published As

Publication number Publication date
CN112685994A (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN109062874B (en) Financial data acquisition method, terminal device and medium
US7958444B2 (en) Visualizing document annotations in the context of the source document
JP4290011B2 (en) Viewer device, control method therefor, and program
US20020010719A1 (en) Method and system for generating document summaries with location information
WO2000020985A1 (en) Conversion of data representing a document to other formats for manipulation and display
EP2343670A2 (en) Apparatus and method for digitizing documents
US8359302B2 (en) Systems and methods for providing hi-fidelity contextual search results
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
CN110879937A (en) Method and device for generating webpage from document, computer equipment and storage medium
CN111797630B (en) PDF-format-paper-oriented biomedical entity identification method
CN103425765A (en) Method and device for extracting webpage text and method and system for webpage preview
CN102081594A (en) Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN112433995B (en) File format conversion method, system, computer device and storage medium
CN102110108B (en) Method and device for processing galley proof file
CN112685994B (en) Double-layer PDF file style formatting output method, device, equipment and medium
AU2005230005B2 (en) Analysis alternates in context trees
US11775733B2 (en) Device dependent rendering of PDF content including multiple articles and a table of contents
CN112990091A (en) Research and report analysis method, device, equipment and storage medium based on target detection
US10606928B2 (en) Assistive technology for the impaired
CN112287742A (en) Method and device for analyzing flow chart in file, computing equipment and storage medium
JP2004021746A (en) Method and system for displaying character string of retrieved result
CN110990671B (en) Page type discrimination device and method and readable storage medium
US11416671B2 (en) Device dependent rendering of PDF content
JP2003346161A (en) In-chart text/chart caption/chart legend/chart kind extraction program, computer-readable recording medium for recording extraction program and in-chart text/chart caption/chart legend/chart kind extraction device
CN117687968A (en) Table analysis method, apparatus, device and document searching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231101

Address after: 350000 building 20, area G, 89 software Avenue, Gulou District, Fuzhou City, Fujian Province

Patentee after: FUJIAN YIRONG INFORMATION TECHNOLOGY Co.,Ltd.

Patentee after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Patentee after: STATE GRID INFO-TELECOM GREAT POWER SCIENCE AND TECHNOLOGY Co.,Ltd.

Patentee after: State Grid Siji Location Service Co.,Ltd.

Address before: 350000 building 20, area G, 89 software Avenue, Gulou District, Fuzhou City, Fujian Province

Patentee before: FUJIAN YIRONG INFORMATION TECHNOLOGY Co.,Ltd.

Patentee before: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Patentee before: STATE GRID INFO-TELECOM GREAT POWER SCIENCE AND TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right