CN117493712B - PDF document navigable directory extraction method and device, electronic equipment and storage medium - Google Patents

PDF document navigable directory extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117493712B
CN117493712B CN202311852456.6A CN202311852456A CN117493712B CN 117493712 B CN117493712 B CN 117493712B CN 202311852456 A CN202311852456 A CN 202311852456A CN 117493712 B CN117493712 B CN 117493712B
Authority
CN
China
Prior art keywords
page
catalog
pdf document
page number
directory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311852456.6A
Other languages
Chinese (zh)
Other versions
CN117493712A (en
Inventor
邓新星
程斯静
顾丹鹏
谢世超
邬远祥
唐海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang East China Engineering Digital Technology Co ltd
PowerChina Huadong Engineering Corp Ltd
Original Assignee
Zhejiang East China Engineering Digital Technology Co ltd
PowerChina Huadong Engineering Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang East China Engineering Digital Technology Co ltd, PowerChina Huadong Engineering Corp Ltd filed Critical Zhejiang East China Engineering Digital Technology Co ltd
Priority to CN202311852456.6A priority Critical patent/CN117493712B/en
Publication of CN117493712A publication Critical patent/CN117493712A/en
Application granted granted Critical
Publication of CN117493712B publication Critical patent/CN117493712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In order to realize the directory extraction of the navigable PDF document, ensure the accuracy of the directory and provide a quick jump function, the invention provides a method, a device, electronic equipment and a storage medium for extracting the navigable directory of the PDF document, wherein the method for extracting the navigable directory comprises the following steps: searching a catalog page of the PDF document; extracting a catalog title and a catalog page number of a page where the catalog is located; converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting picture sequence numbers as navigation page numbers of the pages; identifying page numbers of all page pictures; performing secondary checksum correction on the picture page number based on the page number difference; matching and associating the directory entries with the navigation page numbers to obtain all directory titles, directory page numbers and navigation page numbers; a navigable PDF document catalog is output. By adopting the technical scheme of the invention, the PDF document directory identification accuracy can be improved, and the user can be helped to quickly locate the content of the PDF document directory.

Description

PDF document navigable directory extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of document data processing, in particular to a method and a device for extracting a navigable catalog of a PDF document, electronic equipment and a storage medium.
Background
Digitization transformation is an important trend of current economic and social development, and with the development of new generation information technology and the strong support of national policies, more and more enterprises build a digitization platform to improve the production efficiency and quality. One of the core contents of digital transformation is the integration and utilization of data resources, which involve the processing and use of a large number of electronic documents. For example: an electronic picture center for providing storage, retrieval and on-line reference of engineering files and picture files; and the project management platform is used for providing integrated management of various project documents and extraction of key information.
Currently, PDF (Portable Document Format ) is one of the main formats for processing and transmitting electronic documents. Due to different PDF document production methods, there are a large number of PDF documents that cannot realize the jump of the click catalog to the page of the corresponding content, for example: the catalog is not provided with PDF documents with content links, and the catalog cannot jump; the PDF document formed by traditional paper data scanning cannot read the content of the catalogue, and further cannot realize skip. In the searching process, the PDF document possibly contains the contents which are not in the catalog list except the contents contained in the catalog, so that the catalog page number (namely, the page number corresponding to the catalog title) is inconsistent with the navigation page number (namely, the actual position of the content page corresponding to the catalog title in the whole document), the designated page cannot be directly positioned according to the catalog page number by the manual searching, and the searching efficiency is low.
In terms of catalog extraction and generation of documents, there have been some research methods such as: chinese invention patents with publication numbers CN201611028787.8, CN202211734526.3, CN202110638300.2, CN201910973998.6, etc., but these technologies have the following disadvantages: the method is based on text data or text PDF documents with extracted text and formats for catalog extraction, and the technology cannot process the current large number of unreadable text PDF documents with scanning modes, and has low applicability.
Some methods are also available for extracting the catalogue of the scanned PDF document, for example, chinese patent inventions with publication numbers of CN202111420845.2 and CN 202010919554. X mainly use OCR technology and visual attribute based on document content to process, and the defect is that the method is highly dependent on document format, and because different document formats and formats are different, the accuracy is limited, a large number of documents need to be marked and trained in full text to improve the accuracy, the cost is high, and meanwhile, secondary verification is absent, so that recognition errors cannot be processed. Although the Chinese patent No. CN202310291320.6 is improved in OCR method, only the document directory is extracted, the page number is not extracted, and directory navigation and skip cannot be provided.
In the digital platform, the requirement and frequency of the data consulting function are extremely high, and the navigable PDF document catalog gradually becomes an indispensable function. In the aspect of directory navigation, china patent with publication number CN202310265473.3 provides a method for generating a jumping PDF file by scanning an entity document, the method realizes directory extraction of a scanned PDF document and establishes jumping links of the directory and contents, but the method has obvious defects: firstly, a catalog page searching method is lacked, the page where the catalog is located needs to be manually judged, and then the catalog is identified, so that automation cannot be realized; secondly, identification of the catalogue and the content does not carry out secondary inspection, and accuracy of jump links is affected.
Disclosure of Invention
In order to overcome the defects in the prior art, improve the reading efficiency of users and the utilization rate of electronic documents, the invention aims to establish a more accurate and automatic method, realize the directory extraction of navigable PDF documents, ensure the accuracy of directories and provide a rapid jump function.
Aiming at the problems of the existing method, the invention provides a method, a device and a storage medium for extracting the navigable catalog of the PDF document, improves the identification accuracy of the catalog of the PDF document, and helps a user to quickly locate the content of the catalog of the PDF document.
The first object of the present invention is to provide a method for extracting a navigable catalog of a PDF document, which comprises the following steps:
s101, searching a catalog page of a PDF document;
s102, extracting a catalog title and a catalog page number of a page where the catalog is located;
s103, converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting picture sequence numbers as navigation page numbers of the pages;
S104, identifying page numbers of all page pictures;
s105, performing secondary checksum correction on the picture page number based on the page number difference;
s106, matching and associating the directory entries with the navigation page numbers to obtain all directory titles, directory page numbers and navigation page numbers;
S107, outputting a navigable PDF document catalog.
Preferably, the searching the PDF document directory page comprises any one of the following methods:
1) If the PDF document is a text version: extracting all texts of each page of the PDF document or from beginning to beginning, and then searching Wen Ziban pages of the PDF document catalog;
2) If the PDF document is a scanned version: and identifying each page of the PDF document or starting from the beginning page by using a directory identification model, judging whether the pages are directory pages, and searching all directory pages of the PDF document.
Preferably, the secondary checksum correction on the picture page number based on the page number difference includes the following steps:
1) Subtracting the navigation page number from the picture page number to obtain a page number difference of the PDF document;
2) And performing secondary verification on the picture page number, and correcting the picture page number with the identification error by using the PDF document page number difference.
A second object of the present invention is to provide a PDF document navigable directory extracting apparatus, including:
The catalog page searching module is used for searching catalog pages of PDF documents, and searching catalog pages of PDF documents of different types by adopting different methods;
the catalog extraction module is used for extracting catalog contents of PDF documents, and extracting titles and page numbers of all catalog items of catalog pages of different types of PDF documents by adopting different methods;
The navigation page number extraction module is used for converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting the picture sequence as the navigation page number of the page;
the page number identification module is used for identifying the page numbers of all page pictures;
The page number verification and correction module is used for carrying out secondary verification and correction on the picture page number based on the page number difference;
the page number matching and associating module is used for matching and associating the catalog item with the navigation page number, and finally obtaining all catalog titles, catalog page numbers and navigation page numbers;
and the PDF document catalog output module is used for outputting the navigable PDF document catalog.
Preferably, for Wen Ziban PDF documents, the catalog page lookup module includes:
The page text extraction sub-module is used for extracting all texts of each page of the PDF document or from beginning to beginning;
the catalog page searching sub-module is used for searching Wen Ziban pages where PDF document catalogues are located;
Preferably, for the scanned version PDF document, the catalog page searching module uses a catalog identification model to identify each page of the PDF document or from beginning to beginning, determines whether the page is a catalog page, and searches all catalog pages of the PDF document.
Preferably, the base page code checking and correcting module includes:
1) The page number difference calculation submodule is used for subtracting the navigation page number from the picture page number to obtain the page number difference of the PDF document;
2) The page number verification and correction submodule is used for carrying out secondary verification on the picture page number and correcting the picture page number with the identification error by using the PDF document page number difference.
A third object of the present invention is to propose an electronic device comprising:
a memory for storing a computer program;
and the processor is used for executing the program stored in the memory and realizing the steps of the method for extracting the navigable catalog of the PDF document.
A fourth object of the present invention is to provide a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the steps of any one of the aforementioned PDF document navigable directory extraction methods.
Compared with the prior art, the technical scheme of the invention has the following positive and beneficial effects:
1. The method is applicable to both resolvable text PDF documents and non-resolvable scan PDF documents, and has wider application range and universality.
2. And a secondary verification method based on page number difference is provided, the PDF page number extraction result is checked and corrected, and the directory extraction accuracy is improved.
3. The PDF catalog can be associated with navigation page numbers to generate navigable catalog data, and the method can be widely applied to a digital system and greatly improves the reading efficiency of electronic documents.
Drawings
FIG. 1 is a flow chart of an embodiment of a method for extracting navigable catalog of PDF document;
fig. 2 is a schematic diagram of a navigable catalog extracting device for PDF documents according to the present invention.
Detailed Description
For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.
Interpretation of the terms
Directory title, represented by title, refers to a literal title in a directory page that indicates the structure of a document.
Directory page number, denoted pdf_pn, refers to the page number corresponding to the directory header in the directory page.
The navigation page number, denoted by page_num, refers to the sequential position of the page in the PDF document.
The picture page number is denoted by a folder_int, and refers to the page number displayed at the page number of the page.
And jumping, namely directly positioning the content page corresponding to the title according to the selected directory title.
The embodiment of the invention provides a PDF document navigable directory extraction method and shows a device realized based on the method. The method and apparatus of the present invention will be further described with reference to the accompanying drawings and specific examples.
As shown in FIG. 1, the specific implementation steps of the embodiment of the navigable PDF document directory extraction method of the invention are as follows:
step S101: and searching a catalog page of the PDF document.
Specifically, for a given PDF document, whether the PDF document is a scanned version or a text version is first determined according to whether text within the PDF document is extractable. The searching of the catalog pages of the PDF document of the text version and the scanning version is respectively described in detail below.
1. If the PDF document is a text version, searching a catalog page according to the following steps:
(1) For each page of a PDF document or from scratch, all text of that page is extracted. More open source libraries are available for PDF text extraction, such as pdfminer, pypdf, pdfplumber, and pdfplumber libraries are used in this embodiment.
(2) And searching Wen Ziban pages where the PDF document catalogue is located. The embodiment adopts the following method to search Wen Ziban pages of the PDF document catalog: and calculating the proportion of the English periods in the total text number of each page, and comparing the proportion with a set threshold value, wherein the pages with English periods with the proportion exceeding the threshold value are directory pages. The present embodiment defines this method as a directory feature text duty cycle based method.
In this embodiment, the threshold value is set to 50%. For example: the number of English periods of the text of the page number page I is 79.55% of the total text number of the page, is more than 50%, and is judged to be a catalog page; the English period of the text content of the second page accounts for 0.001%, is less than 50%, and is judged to be a non-catalog page.
It should be noted that, the method based on the directory feature text ratio is applicable to most of PDF documents, and for the directory pages without periods in special cases, a method of scanning the PDF documents is used to search the directory pages.
2. If the PDF document is a scanned version, the catalog page is searched as follows:
and identifying each page of the PDF document or starting from the beginning page by using a directory identification model, judging whether the pages are directory pages, and searching all directory pages of the PDF document.
The catalog identification model may be pre-established. The directory identification model of this embodiment is built by the following method: and labeling the catalog pages and the non-catalog pages based on the general image classification model, and training the general image classification model by using labeling data to obtain a catalog identification model. The general image classification model is a mature technology at present, and a plurality of open-source projects such as Pytorch, tensorflow, hundred-degree flying paddles and other frame open-source image classification algorithms exist on the Internet. In general, the catalogue is obviously different from the page features of the non-catalogue, and the labeling and training workload required by the catalogue identification model is small.
It should be noted that, for the two PDF documents, the directory page number is not required to be identified for searching the entire PDF document, but may be identified in sequence, and when the nth page is a directory page and the n+1th page is a non-directory page, the identification is stopped, and the directory page identified in the 1 to n pages is the directory page of the PDF document.
Step S102: and extracting the catalog title and the catalog page number of the page where the catalog is located.
The following describes in detail how to extract the directory header and directory page number for the text and scanned PDF documents, respectively.
1. If the PDF document is a text version: first, dividing a header and a page number for each row of directories of a directory page; then, for the page portion, non-numbers in the page are filtered, and the header portion and the page portion of each item of the directory are acquired. In this embodiment, the regular expression is used to divide the header and the page number of each row of the directory, which is a prior art and will not be described herein.
2. If the PDF document is a scanned version: and extracting the title and page number of the directory by adopting a directory identification model.
The catalog identification model may adopt an existing maturation technology, such as an open source maturation algorithm like a pad layout analysis model, or may use an existing tool, such as an office document identification tool of hundred degrees intelligent cloud (see closed.baidu.com/product/ocr/doc_analysis_office).
In this embodiment, the extraction result of the directory page is as follows:
[
{
Title "1. Introduction",
"pdf_pn": "1"
},
{
Title "2. Unstructured data management",
"pdf_pn": "2"
},
{
"Title" 2.1. Unstructured data definitions and features ",
"pdf_pn": "2"
},
{
"Title" 2.2. Unstructured data management development history ",
"pdf_pn": "4"
},
{
"Title": "3. Unstructured data management system",
"pdf_pn": "6"
},
……
]
Wherein each entry of the array is a directory entry, and each directory entry contains a directory title and a directory page number, wherein title represents the directory title and pdf_pn represents the directory page number.
Step S103: and extracting the navigation page number, namely converting each page of the PDF document into pictures, sequencing all the pictures according to the page sequence, and extracting the picture sequence number as the navigation page number of the page.
Specifically, each page of the PDF document is converted into pictures, the pictures are stored in a temporary folder, all the pictures are ordered according to page sequence, the sequence number of the page picture is extracted to serve as the navigation page number of the page, and the sequence number of the page picture is taken as the file name of the page picture. In this embodiment, 57 pictures are obtained by extracting the navigation page number, the serial numbers of the pictures start to be numbered from "0", and if the serial number of the page in the folder is 2, the navigation page number is 2.
Step S104, the page numbers of all the page pictures are identified.
The picture page number refers to the page number identification result for all page pictures after each page of the PDF document is converted into a picture in step S103. The picture page number is an intermediate result required for calculating the page number difference of the PDF document in the subsequent step S105. It should be noted that, pages such as a cover page, a description page, a blank page, a catalog page, etc. of the PDF document may not have page numbers, and this part of pages is not generally included in the catalog content of the PDF document, and the method of this embodiment only extracts the range covered by the catalog, so that it is not necessary to process this part of pages. Specifically, for the picture obtained in step S103, picture page recognition is performed using a page recognition model.
The page number recognition model may be established in advance. The embodiment adopts the following method to establish a page number identification model: and marking the page number of the page based on the universal OCR model, and training the universal OCR model by using marking data to obtain a page number recognition model. The general OCR model is a mature technology at present, and more open-source projects such as a pad layout analysis model and the like exist on the Internet. Generally, the layout of the page numbers is standard, and the labeling and training workload required by the page number identification model is small.
It should be noted that, due to the reasons of incomplete scanned page, occlusion, model accuracy, etc., there may be few images with page numbers that are not recognized or that are not recognized correctly.
In the present embodiment, the partial result of the picture page number recognition is shown in the following code:
[
{
"page_num": 0,
"footer_int": null
},
{
"page_num": 1,
"footer_int": null
},
{
"page_num": 2,
"footer_int": null
},
{
"page_num": 3,
"footer_int": null
},
{
"page_num": 4,
"footer_int": 1
},
{
"page_num": 5,
"footer_int": 2
},
{
"page_num": 6,
"footer_int": 3
},
{
"page_num": 7,
"footer_int": 4
},
{
"page_num": 8,
"footer_int": null
},
……
]
The page_num represents a navigation page number, the page_int represents a picture page number, wherein pictures with the navigation page number of 0-3 are covers, copyright notices and catalogues, no page number exists on the page, and the picture page number is null; the picture page number with the navigation page number of 8 fails to be identified, the picture page number is null, and other pictures are successfully identified.
It should be noted that, since the accuracy of page recognition based on OCR is hardly 100%, page recognition failure or recognition error is a common case. In the prior art, the page number identification result is rarely checked, and no automatic verification method exists, and the embodiment of the invention provides a secondary verification method based on page number difference, which is an important innovation point of the embodiment of the invention. See step 105 below for details.
Step S105, performing secondary checksum correction on the picture page based on the page difference.
(1) And calculating the page number difference of the PDF document.
First, for each picture, subtracting the navigation page number from the picture page number to obtain the page number difference of each picture.
Then, considering that the accuracy of OCR recognition is hardly 100%, there are a few pictures whose page number difference is different from other page number differences, taking most of the page number differences as the page number differences of the present PDF. Such as: and counting the page number differences of all the pictures, sorting the pictures according to the frequency from high to low, and finally, taking out the page number difference with the highest current frequency as the page number difference of the PDF document.
In this embodiment, the navigation page number of the second page is 4, the picture page number is 1, and the page number difference is 4-1=3. And analogically obtaining the page number difference of each picture, wherein the picture page number with the navigation page number of 8 is null, the page number difference cannot be calculated, and the page number difference is null. Then, all the page differences are counted, 49 pictures with the page difference of 3 are obtained, and 8 pictures with the page difference of null are obtained. And finally taking the page number difference 3 of most pages as the page number difference of the PDF document.
(2) And performing secondary verification on the picture page number, correcting the picture page number with the identification error by using the PDF document page number difference, and complementing the picture page number.
In this step, the secondary verification is performed on the picture page extraction result of S104, and the following two cases belong to picture page identification errors: 1) Pictures for which a page number difference cannot be calculated, for example, pictures for which a picture page number is identified as null; 2) Pictures with a page number difference not equal to that of the PDF document, for example, the PDF document has a page number difference of 3, the navigation page number of the nth picture is 19, the picture page number is 18, and if the page number difference is 1,1 not equal to 3, the picture page number identification is wrong.
In this embodiment, the correction method for correcting the picture page number with the identification error by using the page number difference of the PDF document includes: picture page number = navigation page number-page number difference. In this embodiment, the picture filter_int with the navigation page number of 8 is null, and the correction result is shown in the following code:
[
……
{
"page_num": 7,
"real_pn": 8,
"footer_int": 4,
"footer_patched": 4
},
{
"page_num": 8,
"real_pn": 9,
"footer_int": null,
"footer_patched": 5
},
{
"page_num": 9,
"real_pn": 10,
"footer_int": 6,
"footer_patched": 6
},
……
]
Wherein page_num represents a navigation page number, folder_int represents a picture page number, folder_patched represents a picture page number after completion, and the second piece of data in the code example represents a picture page number after completion of the second piece of data, folder_patched=8-3=5.
It should be noted that real_pn represents an actual navigation page number, and since the navigation page number is ordered from 0, and the pages of the PDF document are actually ordered from 1, the actual navigation page number is obtained by adding one to the navigation page number. Both the navigation page number and the actual navigation page number may be used for navigation jumps of PDF documents, depending on the application implementation.
It should be noted that steps 101-102 and 103-105 may be performed in parallel, without strict order.
Step S106, matching and associating the directory entry with the navigation page number.
Specifically, for the directory extraction result obtained in S102, for each directory entry including a directory header and a directory page number, all picture page numbers are searched, and if there is a picture page number identical to the directory page number, the directory entry is associated with a navigation page number, that is, a directory header-navigation page number-directory page number.
In this embodiment, the directory extraction result in step S102 and the corrected picture page recognition result in step S105 are matched, and for each directory entry in the directory extraction result, when the focus_patched (corrected picture page) in the corrected picture page recognition result is equal to the pdf_pn (directory page) in the directory extraction result, page_num (navigation page) and real_pn (actual navigation page) in the corrected picture page recognition result are added to the directory entry in the directory extraction result. The matching and association results are shown in the following codes:
[
{
Title "1. Introduction",
"page_num": 4,
"real_pn": 5,
"pdf_pn": 1
},
{
Title "2. Unstructured data management",
"page_num": 5,
"real_pn": 6,
"pdf_pn": 2
},
{
"Title" 2.1. Unstructured data definitions and features ",
"page_num": 5,
"real_pn": 6,
"pdf_pn": 2
},
{
"Title" 2.2. Unstructured data management development history ",
"page_num": 7,
"real_pn": 8,
"pdf_pn": 4
},
{
"Title": "3. Unstructured data management system",
"page_num": 9,
"real_pn": 10,
"pdf_pn": 6
},
……
]
For example, the association result of item 4 of the directory entry is: the directory is entitled "2.2 unstructured data definitions and features," navigation page number "7," directory page number "4.
Step S107, a navigable PDF document catalog is output.
Specifically, the association results of all directory entries are obtained, and a navigable PDF document directory is output. The application system can utilize the catalog to link the catalog with the page by searching the navigation page number to realize the jump.
The invention also provides an embodiment of a device for extracting the navigable catalog of the PDF document, the schematic diagram of the device is shown in figure 2, and the device comprises:
The catalog page searching module is used for searching catalog pages of PDF documents, and searching catalog pages of PDF documents of different types by adopting different methods;
the catalog extraction module is used for extracting catalog contents of PDF documents, and extracting titles and page numbers of all catalog items of catalog pages of different types of PDF documents by adopting different methods;
The navigation page number extraction module is used for converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting the picture sequence as the navigation page number of the page;
the page number identification module is used for identifying the page numbers of all page pictures;
The page number verification and correction module is used for carrying out secondary verification and correction on the picture page number based on the page number difference, and the picture page number is complemented by using the PDF document page number difference;
the page number matching and associating module is used for matching and associating the catalog item with the navigation page number, and finally obtaining all catalog titles, catalog page numbers and navigation page numbers;
and the PDF document catalog output module is used for outputting the navigable PDF document catalog.
The catalog page lookup module includes one of:
1) For Wen Ziban PDF documents, comprising:
The page text extraction sub-module is used for extracting all texts of each page of the PDF document or from beginning to beginning;
the catalog page searching sub-module is used for searching Wen Ziban pages where PDF document catalogues are located;
2) For a scanned version of a PDF document, comprising:
The catalog identification model building sub-module is used for marking the catalog pages and the non-catalog pages according to the general image classification model, and training the general image classification model by using marking data to obtain a catalog identification model;
And the catalog page searching sub-module is used for identifying each page of the PDF document or starting from the beginning page by using the catalog identification model, judging whether the PDF document is the catalog page or not, and searching all the catalog pages of the PDF document.
The page number recognition module includes:
1) The page number recognition model building sub-module is used for marking the page number of the page based on the universal OCR model, and training the universal OCR model by using marking data to obtain a page number recognition model;
2) And the page number identification sub-module is used for identifying the page number of the picture page by using the page number identification model.
The base page code checking and correcting module comprises:
1) The page number difference calculation submodule is used for subtracting the navigation page number from the picture page number to obtain the page number difference of the PDF document;
2) The page number verification and correction submodule is used for carrying out secondary verification on the picture page number and correcting the picture page number with the identification error by using the PDF document page number difference.
The specific implementation manner of each module is the same as that of each step of the aforementioned PDF document navigable directory extraction method, and is not repeated here.
In this embodiment, the navigable catalog extraction device for PDF documents provides a front-end page of an application system, uploads PDF documents to a server through an input module in the page, invokes an algorithm implemented based on the method, and includes a catalog page search module, a catalog extraction module, a navigation page extraction module, a page identification module, a page verification and correction module, a page matching and association module, and the like, outputs a navigable PDF document catalog, and accesses the application system to realize document navigation and skip.
The embodiment of the invention also provides electronic equipment, which comprises:
a memory for storing a computer program;
And the processor is used for executing the program stored in the memory and realizing the steps of the embodiment of the PDF document navigable directory extraction method.
For specific implementation of each step of the method and related explanation content, reference may be made to the aforementioned embodiment of the PDF document navigable directory extraction method, which is not described herein.
The Memory of the electronic device according to this embodiment may include a random access Memory (Random Access Memory, RAM) or may include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the steps of the embodiment of the PDF document navigable catalog extraction method are realized. For specific implementation of each step of the method and related explanation content, reference may be made to a PDF document navigable directory extraction embodiment, which is not described herein.
It should be noted that, in the present specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments.
In particular, for apparatus, electronic devices, computer readable storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, and relevant references are made to the partial description of method embodiments.
The above description is only illustrative of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention, and any alterations and modifications made by those skilled in the art based on the above disclosure are within the scope of the present invention.

Claims (6)

1. The PDF document navigable directory extraction method is characterized by comprising the following steps:
s101, searching a catalog page of a PDF document;
The searching PDF document directory page comprises any one of the following methods:
1) If the PDF document is a text version: extracting all texts of each page of the PDF document or from beginning to beginning, and then searching Wen Ziban pages of the PDF document catalog;
2) If the PDF document is a scanned version: identifying each page of the PDF document or starting from the beginning page by using a directory identification model, judging whether the pages are directory pages, and searching all directory pages of the PDF document;
s102, extracting the catalog titles and catalog page numbers of all catalog items of a page where the catalog is located;
s103, converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting picture sequence numbers as navigation page numbers of the pages;
s104, identifying page numbers of all page pictures to obtain picture page numbers;
S105, based on the page differences obtained by subtracting the navigation page numbers and the picture page numbers, counting the page differences of all the obtained pictures, sorting the pictures according to the frequency from high to low, and finally, taking out the page difference with the highest frequency as the page difference of the PDF document; performing secondary verification on the picture page number, correcting the picture page number with the identification error by using the PDF document page number difference, and complementing the picture page number;
S106, searching a picture page number which is the same as the directory page number in the directory entry, matching the navigation page number corresponding to the picture page number with the directory entry, and obtaining the navigation page number corresponding to the directory entry, wherein the directory entry comprises a directory title and a directory page number;
S107, outputting a navigable PDF document catalog composed of catalog titles, catalog page numbers and navigation page numbers, and linking the catalog with the page by retrieving the navigation page numbers to realize jumping.
2. The PDF document navigable directory extraction method of claim 1, wherein the using a directory identification model to find PDF document directory pages comprises the following method: and labeling the catalog pages and the non-catalog pages based on the general image classification model, and training the general image classification model by using labeling data to obtain a catalog identification model.
3. A PDF document navigable directory extraction apparatus, comprising:
the catalog page searching module is used for searching catalog pages of PDF documents;
For Wen Ziban PDF documents, the catalog page lookup module includes: the page text extraction sub-module is used for extracting all texts of each page of the PDF document or from beginning to beginning; the catalog page searching sub-module is used for searching Wen Ziban pages where PDF document catalogues are located;
For a scanned PDF document, the catalog page searching module uses a catalog identification model to identify each page of the PDF document or page by page from beginning, judges whether the PDF document is a catalog page, and searches all catalog pages of the PDF document;
the catalog extraction module is used for extracting catalog contents of PDF documents, and extracting titles and page numbers of all catalog items of catalog pages of different types of PDF documents by adopting different methods;
the navigation page number extraction module is used for converting each page of the PDF document into pictures, sequencing all the pictures according to page sequence, and extracting the picture sequence as the navigation page number of the page; the page number identification module is used for identifying the page numbers of all page pictures to obtain the picture page numbers;
The page number verification and correction module is used for counting the page number differences of all pictures based on the page number obtained by subtracting the navigation page number from the picture page number, sorting the obtained page number differences from high to low according to the frequency, finally, taking out the page number difference with the highest frequency as the page number difference of the PDF document, carrying out secondary verification on the picture page number, correcting the picture page number with the identification error by using the page number difference of the PDF document, and completing the picture page number;
the page number matching and associating module is used for searching a picture page number which is the same as the directory page number in the directory entry, matching the navigation page number corresponding to the picture page number with the directory entry, and finally obtaining the navigation page number corresponding to the directory entry; the catalog item comprises a catalog title and a catalog page number;
the PDF document catalog output module is used for outputting a navigable PDF document catalog composed of catalog titles, catalog page numbers and navigation page numbers, and the catalog is linked with the page by searching the navigation page numbers to realize the jump.
4. The PDF document navigable catalog extraction apparatus of claim 3 wherein the catalog page lookup module includes a catalog recognition model training sub-module for labeling catalog pages and non-catalog pages based on a generic image classification model, the generic image classification model being trained using labeling data to obtain the catalog recognition model.
5. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing a program stored on a memory, implementing the method steps of any one of claims 1-2.
6. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-2.
CN202311852456.6A 2023-12-29 2023-12-29 PDF document navigable directory extraction method and device, electronic equipment and storage medium Active CN117493712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311852456.6A CN117493712B (en) 2023-12-29 2023-12-29 PDF document navigable directory extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311852456.6A CN117493712B (en) 2023-12-29 2023-12-29 PDF document navigable directory extraction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117493712A CN117493712A (en) 2024-02-02
CN117493712B true CN117493712B (en) 2024-06-21

Family

ID=89669425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311852456.6A Active CN117493712B (en) 2023-12-29 2023-12-29 PDF document navigable directory extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117493712B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048908A (en) * 2022-06-29 2022-09-13 珠海豹好玩科技有限公司 Method and device for generating text directory

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06231127A (en) * 1993-02-01 1994-08-19 Hitachi Ltd Method for automatically sampling page number
JP2002024796A (en) * 2000-07-06 2002-01-25 Matsushita Electric Ind Co Ltd Character recognition device and method
CN103778141A (en) * 2012-10-23 2014-05-07 南开大学 Mixed PDF book catalogue automatic extracting algorithm
CN105095285B (en) * 2014-05-14 2019-03-26 北大方正集团有限公司 Digital publication guide to visitors catalogue treating method and apparatus
CN107291682B (en) * 2016-03-30 2020-12-08 同方知网(北京)技术有限公司 Multi-electronic-document segmentation algorithm based on skip processing and double verification
CN106250830B (en) * 2016-07-22 2019-05-24 浙江大学 Digital book structured analysis processing method
US10956731B1 (en) * 2019-10-09 2021-03-23 Adobe Inc. Heading identification and classification for a digital document
CN110837788B (en) * 2019-10-31 2022-10-28 北京深度制耀科技有限公司 PDF document processing method and device
CN111753500B (en) * 2020-07-07 2021-05-04 江苏中威科技软件系统有限公司 Method for merging and displaying formatted electronic form and OFD (office file format) and generating catalog
CN112016273B (en) * 2020-09-03 2024-03-12 平安科技(深圳)有限公司 Document catalog generation method, device, electronic equipment and readable storage medium
CN112818647A (en) * 2021-01-14 2021-05-18 史朝斌 System manuscript examining method based on image recognition comparison and artificial intelligence automatic comparison
CN116092108A (en) * 2023-03-20 2023-05-09 四川竺信档案数字科技有限责任公司 Method, system and storage medium for generating PDF file by scanning entity document

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048908A (en) * 2022-06-29 2022-09-13 珠海豹好玩科技有限公司 Method and device for generating text directory

Also Published As

Publication number Publication date
CN117493712A (en) 2024-02-02

Similar Documents

Publication Publication Date Title
US8315997B1 (en) Automatic identification of document versions
CN102053991B (en) Method and system for multi-language document retrieval
US8391614B2 (en) Determining near duplicate “noisy” data objects
US9430478B2 (en) Anchor image identification for vertical video search
CN109144968B (en) Data distribution management system
CN109598228B (en) Method and system for electronically recording and archiving paper files
CN104346415B (en) Method for naming image document
CN114117171A (en) Intelligent project file collecting method and system based on energized thinking
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
CN111310750B (en) Information processing method, device, computing equipment and medium
CN110688349A (en) Document sorting method, device, terminal and computer readable storage medium
CN111353004A (en) Data association analysis method and system for drug document
CN111353005A (en) Drug research and development reporting document management method and system
CN113190502A (en) Archive management method based on deep learning
CN112800949A (en) Artificial intelligence-based paper archive digital processing method, system and equipment
CN111860524A (en) Intelligent classification device and method for digital files
US9672438B2 (en) Text parsing in complex graphical images
CN109670092A (en) XML document proofreading method and device
CN113220821A (en) Index establishing method and device for test question retrieval and electronic equipment
CN117493712B (en) PDF document navigable directory extraction method and device, electronic equipment and storage medium
US20090327210A1 (en) Advanced book page classification engine and index page extraction
CN107577667B (en) Entity word processing method and device
CN114003750B (en) Material online method, device, equipment and storage medium
CN112925874B (en) Similar code searching method and system based on case marks
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant