CN110705223A - Footnote recognition and extraction method for multi-page layout document - Google Patents

Footnote recognition and extraction method for multi-page layout document Download PDF

Info

Publication number
CN110705223A
CN110705223A CN201910743304.XA CN201910743304A CN110705223A CN 110705223 A CN110705223 A CN 110705223A CN 201910743304 A CN201910743304 A CN 201910743304A CN 110705223 A CN110705223 A CN 110705223A
Authority
CN
China
Prior art keywords
footmark
line
page
footnote
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910743304.XA
Other languages
Chinese (zh)
Inventor
徐剑波
张诗玉
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongxin Boya Technology Co Ltd
Original Assignee
Beijing Zhongxin Boya Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongxin Boya Technology Co Ltd filed Critical Beijing Zhongxin Boya Technology Co Ltd
Priority to CN201910743304.XA priority Critical patent/CN110705223A/en
Publication of CN110705223A publication Critical patent/CN110705223A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a footnote identification and extraction method for a multi-page layout document, and belongs to the field of information extraction. The method extracts character blocks and line segments in the format file, extracts character blocks with character size, typesetting and other characteristics meeting conditions in the page as footnote character candidates, compares small characters in the text with characters in the footnote area to improve the accuracy of candidate footnote quotation, extracts footnote lines, distinguishes initial footnote lines and subsequent footnote lines, cross-verifies the footnote lines and the footnote characters to eliminate interference characters and interference line segments, and finally confirms the footnote area. The method can automatically extract the footnotes from the layout documents of a plurality of pages, is suitable for different typesetting styles, is suitable for the conditions that the footnotes are at the bottom of the page or at the tail of an article and the like, and ensures high accuracy.

Description

Footnote recognition and extraction method for multi-page layout document
Technical Field
The invention relates to the field of format document information extraction, in particular to a footnote identification and extraction method for a multi-page format document.
Background
The format document format is an electronic document format with a fixed layout presentation effect, the presentation of the format document is independent of equipment, and the presentation results of the layout are consistent when the layout document is read, printed or printed on various equipment. The layout document is mainly applied to publishing, spreading and archiving of the files after the files become text. Common format document formats are PDF, CEBX, OFD, etc. The format document format defines information such as the presentation data of the layout of a plurality of pages, the presentation position, the color, the font size and the like of each page internal object (characters, images, graphics and the like) in the layout, so that the parser and the reader can present document contents page by page according to the format, and people can read the document contents conveniently. In one document, cross-references (e.g., footnotes, references, charts) are used to label and further explain their corresponding entities. A cross-reference comprises two parts: references and entities. For example, for footnotes, a reference refers to a reference in the body that generally appears in superscript form, and an entity refers to text at the bottom of the page or later that further explains where the body corresponds to the reference. As one of the cross references, identifying footnotes plays an important role for the core task of document structure understanding, page element type annotation. The content information contained in the footnote itself and the link correspondence information are also helpful for the understanding of the document content. However, the format document does not describe such a reference relationship, and when the format document is structured, it is necessary to restore the reference relationship, that is, to identify the footnotes and extract the entities.
In the prior art, the research for extracting footnotes from documents is not many: anjewierden describes a document analysis system named AIDAS that employs an incremental bottom-up extraction of the logical structure of a document (including footnotes, graphic titles, etc.), but the syntax used by the system depends on the particular document type. Marinai et al extracted footnotes during the conversion of PDF to EPUB format. They recognize as a reference to the footnote a number having a font size that is less than a certain proportion of the body word. Under such assumptions, non-numeric references cannot be correctly identified, while formula superscripts may be misidentified as references to footnotes. At the same time, they identify paragraphs that begin with numbers and have body words smaller than the document body words as entities of footnotes. But not all types of footnotes have a font size smaller than the document body word size, and the method may misidentify ordered lists and directories as footnotes. Patent document "a method for identifying footnotes in format documents and a method for associating the footnotes with the footnotes citations (application number CN 102015000342271)" describes a method for identifying footnotes based on feature clustering, which obtains style features of the footnotes in documents by means of feature clustering, so that the identification process can adapt to documents of different styles without depending on specific features and rules, and then associates the footnotes with the footnotes citations. However, due to the diversity of the typesetting style, the interference between the list and the footnotes may exist, so that the footnote regions cannot be normally identified due to the fact that effective clustering cannot be performed. In addition, the invention assumes that the footnotes are positioned at the tail of the page, and in reality, a large number of documents of the footnotes are positioned at the tail of the whole article.
Therefore, a person skilled in the art needs to extract footnotes from the layout document automatically to adapt to different typesetting styles, and to improve the accuracy of footnote extraction by using footnote areas possibly located at the bottom of a page or at the tail of an article.
Disclosure of Invention
In order to solve the problems in the prior art and achieve the purposes, the invention adopts the technical scheme that: a method for recognizing and extracting footnotes of a multi-page layout document.
The technical scheme adopted by the invention is as follows: a footnote identification and extraction method for a multi-page layout document comprises the following steps:
analyzing a format document, and acquiring page information of the format document, characters in a page and path information page by page;
step two, preprocessing, identifying header and footer, and taking out the header and footer from each page; counting the size distribution of character sizes in each page, and taking the most appeared character sizes as the character sizes of the text; counting the coordinate information of the character lines in each page;
step three, traversing page by page, and extracting the script character blocks in the page, wherein the specific steps are as follows: traversing character blocks in a page, and adding the character blocks meeting the following conditions into a corresponding set; if the font size of the text block is smaller than the font size of the text, the proportionality coefficient is set to be 0.8, and the text block is aligned with the left boundary of the page layout center, adding the text block into the set matched _ objs; if the word size of a word block is smaller than the word size of the text and the word block is not aligned with the left boundary of the page version center, adding the word block into a set inline _ objs, if the word size of the word block does not meet the requirement and the word block is aligned with the left boundary of the page version center, adding the word block into a set candi _ lines, if the set inline _ objs is not empty and the matched _ objs is empty, pairwise matching the inline _ objs and the candi _ lines, if the head word string of the word block marked as line in the candi _ lines is consistent with the word block obj of the inline _ objs, adding the line into the matched _ objs set, removing the obj from the inline _ objs until the beginning word string of the inline _ objs is empty or comparing the candi _ lines completely, and taking the copied _ objs as a footnote word set of the page;
step four, extracting the footmark line of the page, starting from the tail of the page, searching a horizontal line which simultaneously meets the following conditions to serve as a footmark line candidate set, marking the footmark line as an initial footmark line when the length of the footmark line is close to 144 with the left boundary footmark line of the page center, marking the length of the footmark line as the width of the page center as a continuous footmark line, preferably selecting one of the footmark line candidate set as the footmark line according to the following rules, if the footmark line candidate set is empty, if a footmark line block set smallobjs is empty, taking the first of the footmark line candidate set as the footmark line, if the footmark line candidate set and the footmark word block set are both non-empty, checking whether conflicts exist or not to remove the interferences, and checking: and (4) traversing the candidate foot-notation line sets one by setting the number of the candidate foot-notation lines as candi _ count, and dividing the foot-notation character block into an upper set above _ obj and a lower set below _ obj by taking each candidate foot-notation line as a dividing line. The traversal is aborted until above _ obj is empty. A total of n segmentation schemes are obtained. If n is 1, only one effective segmentation is carried out, if n is greater than 1, the candidate foot mark line corresponding to the segmentation is taken as the selected foot mark line, and if n is greater than 1, the candidate foot mark line corresponding to the last segmentation scheme is taken as the selected foot mark line for verification and confirmation of the foot mark area, and the method specifically comprises the following steps: if the footmark line is not found, the page has no footmark, if the footmark line is found, the area below the footmark line is used as a candidate area of the footmark, further verification is carried out, the character size distribution of characters in the candidate area is extracted, the character size distribution with the most character appearing is used as the footmark character size, if a footmark character block set smallobjs is empty, if the current footmark line is the footmark line of a subsequent page and the difference between the footmark character size and the footmark character size of the previous page is larger, the footmark is rejected and returned according to the footmark, if the footmark character block set smallobjs is not empty, if the current footmark line is the initial footmark line and no character block below the current footmark line exists in the smallobjs, the footmark is rejected and returned according to the footmark;
and step five, after the examination, confirming that the candidate area is the footnote area.
The method has the advantages that character blocks and line segments in the format file are extracted, the character blocks with character sizes, typesetting and other characteristics meeting conditions in the page are extracted to serve as footnote character candidates, the small characters in the text are compared with the characters in the footnote area to improve the accuracy of candidate footnote reference, meanwhile, the footnote lines are extracted, the initial footnote lines and the subsequent footnote lines are distinguished, the footnote lines and the footnote characters are verified in a cross mode to eliminate interference characters and interference line segments, and finally the footnote area is confirmed. The method can automatically extract the footnotes from the layout documents of a plurality of pages, is suitable for different typesetting styles, is suitable for the conditions that the footnotes are at the bottom of the page or at the tail of an article and the like, and ensures high accuracy.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein: FIG. 1 is a schematic diagram of steps of a method for identifying and extracting footnotes of a multi-page layout document.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description-a method for footnote recognition and extraction of a multi-page layout document includes the steps of:
1. analyzing the format document, and acquiring page information and character and path information in the page by page, wherein:
a) the page information includes page size information
b) The text block information includes information such as code, color, position (enclosing rectangle), font size, italic bold, etc. in the sub-characters, and the text block set is recorded as raw _ obj
c) Original output sequence number of text block in layout document (marked idx)
d) Path information (path, or line segment, drawn by the instruction in the layout document)
2. Performing a pretreatment
a) Identifying header footers and extracting header footers from individual pages
b) Counting the size distribution of the character size in each page, and taking the most appeared character size as the character size of the text
c) Counting coordinate information of lines of characters in each page
3. Traversing page by page, and extracting script character blocks (smallobjs) in the page, which comprises the following specific steps:
a) traversing character blocks in the page, and adding the character blocks meeting the following conditions into a corresponding set
i. If the block size is less than a certain proportion of the block size and the block is aligned with the left border of the page block center, then the block is added to the collection (formatted _ obj). The scaling factor can be empirically set to 0.8
if the block size is less than a certain proportion of the block size, the scaling factor can be empirically set to 0.8, and the block is not aligned with the left border of the page layout center, then add the block to the set (inline _ objs)
if the block size does not meet the above requirement but the block is aligned with the left border of the page block center, add the block to the collection (candi lines)
b) If the set inline _ objs is not empty and matched _ objs is empty, then inline _ objs and candi _ lines are pairwise matched
i. If the first character string of character blocks (line) in candi _ lines matches the character block (obj) of inline _ obj, the line is added to the matched _ obj set, and obj is removed from inline _ obj
ii, until inline _ objs is empty, or candi _ lines full alignment is completed
c) Taking formatted _ objs as a footnote text block set of the page (marked as smallobjs)
4. The method for extracting the footnotes of the page comprises the following specific steps:
a) starting from the tail of the page, horizontal lines satisfying the following conditions are searched as a candidate set (ordered set) of the footmark lines
i. The footmark line is close to 144 (marked as a starting footmark line) in length with the left boundary footmark line of the page center;
ii, the length of the footmark line is the width of the page plate center (marked as a continuous footmark line);
b) one of the footmark candidate sets is preferably selected as the footmark according to the following rule;
i. if the candidate set of the footmark lines is empty, no footmark line exists;
if the set of footnoted character blocks (smallobjs) is empty, then treating the first of the candidate set of footmarks as a footmark;
and iii, when the footmark line candidate set and the footmark character block set are not empty, checking whether conflicts exist or not so as to remove interference. And (3) checking: (let the number of candidate footmark lines be candi _ count)
1. And traversing the candidate footmark line sets one by one, taking each candidate footmark line as a dividing line, and dividing the footmark character block into an upper set and a lower set (above _ obj, below _ obj). The traversal is aborted until above _ obj is empty. Obtaining n segmentation schemes in total;
2. if n is 1, only one effective segmentation is carried out, and the candidate footmark line corresponding to the segmentation is taken as the selected footmark line;
3. if n is greater than 1, selecting a candidate footmark line corresponding to the last segmentation scheme as a selected footmark line;
5. verifying and confirming the footnote area, which comprises the following specific steps:
a) if the footnote line is not found, the page has no footnotes;
b) if finding the footmark line, taking the area below the footmark line as a candidate area of the footmark, and further checking;
i. extracting the font size distribution of characters in the candidate region, and taking the font size with the most occurrence as the footnote font size;
if the footnote character block set (smallobjs) is empty, if the current footnote line is the footnote line of the subsequent page and the difference between the footnote character number and the footnote character number of the previous page is larger, rejecting the footnote and returning according to no footnote;
if the footnote character block set (smallobjs) is not empty, if the previous footnote line is the initial footnote line and the character block below the current footnote line does not exist in the smallobjs, rejecting the footnote and returning according to the footnote absence;
c) after the examination, the candidate region is confirmed as the footnote region.

Claims (4)

1. A footnote identification and extraction method for a multi-page layout document is characterized by comprising the following steps: analyzing a format document, and acquiring page information of the format document, characters in a page and path information page by page; step two, preprocessing, identifying header and footer, and taking out the header and footer from each page; counting the size distribution of character sizes in each page, and taking the most appeared character sizes as the character sizes of the text; counting the coordinate information of the character lines in each page; step three, traversing page by page, and extracting script character blocks in the page; step four, extracting footnotes of the page; and step five, verifying and confirming the footnote area.
2. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the third step comprises the following specific steps: adding the character blocks meeting the conditions into the corresponding set; if the word size of the text block is smaller than the text word size and the text block is aligned with the left boundary of the page layout center, adding the text block into the set matched _ obj, if the word size of the text block is smaller than the text word size and the text block is not aligned with the left boundary of the page layout center, adding the text block into the set inline _ obj, if the word size of the text block does not meet the requirement, but the text block is aligned with the left boundary of the page layout center, adding the text block into the set candi _ lines, and if the set inline _ obj is not empty and the matched _ obj is empty, matching the inline _ obj and the candi _ lines two by two; if the head character string of the character blocks in candi _ lines marked as line is consistent with the character block obj of inline _ obj, the line is added into the matched _ obj set, the obj is removed from the inline _ obj until the inline _ obj is empty or the candi _ lines are completely compared, and the matched _ obj is taken as the footed character block set of the page and marked as smallobjs.
3. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the fourth step comprises the following specific steps: starting from the tail of the page, searching horizontal lines which simultaneously meet the conditions as a candidate set of the footmark lines; marking the length of the footmark line close to 144 of the left boundary footmark line of the page layout center as a starting footmark line, marking the length as the width of the page layout center as a continuous footmark line, and preferably selecting one of the footmark lines from the footmark line candidate set as the footmark line according to the following rules; if the candidate set of the footmark lines is empty, no footmark line exists, if the candidate set of the footmark lines is empty, the first one of the candidate set of the footmark lines is taken as the footmark line, and if the candidate set of the footmark lines and the candidate set of the footmark text blocks are not empty, whether conflicts exist or not needs to be checked to remove interference; and (3) checking: setting the number of candidate footnotes as candi _ count, traversing the candidate footnote line sets one by one, taking each candidate footnote line as a dividing line, dividing the footnote character block into an upper set above and a lower set above _ obj and below _ obj, and stopping traversal until the above _ obj is empty; and obtaining n segmentation schemes in total, if n is 1, only one effective segmentation is carried out, the candidate footmark line corresponding to the segmentation is taken as the selected footmark line, and if n is greater than 1, the candidate footmark line corresponding to the last segmentation scheme is taken as the selected footmark line.
4. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the concrete steps of the fifth step are as follows: if the footmark line is not found, the page has no footmark, if the footmark line is found, the area below the footmark line is used as a candidate area of the footmark, the checking is further carried out, the character size distribution of characters in the candidate area is extracted, the character size distribution with the most character appearing is used as the footmark character size, if a footmark character block set smallobjs is empty, if the current footmark line is the footmark line of the subsequent page and the difference between the footmark character size and the footmark character size of the previous page is larger, the footmark is rejected and returned according to the footmark, if the footmark character block set smallobjs is not empty, if the current footmark line is the initial footmark line and no character block below the current footmark line exists in the smallobs, the footmark is rejected and returned according to the footmark, and after the checking, the candidate area is determined to be the footmark area.
CN201910743304.XA 2019-08-13 2019-08-13 Footnote recognition and extraction method for multi-page layout document Pending CN110705223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910743304.XA CN110705223A (en) 2019-08-13 2019-08-13 Footnote recognition and extraction method for multi-page layout document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910743304.XA CN110705223A (en) 2019-08-13 2019-08-13 Footnote recognition and extraction method for multi-page layout document

Publications (1)

Publication Number Publication Date
CN110705223A true CN110705223A (en) 2020-01-17

Family

ID=69193388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910743304.XA Pending CN110705223A (en) 2019-08-13 2019-08-13 Footnote recognition and extraction method for multi-page layout document

Country Status (1)

Country Link
CN (1) CN110705223A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382561A (en) * 2020-03-13 2020-07-07 北大方正集团有限公司 File verification method, device, equipment and storage medium
CN111626036A (en) * 2020-05-27 2020-09-04 南京蓝鲸人网络科技有限公司 Novel image-text typesetting processing method
US10956673B1 (en) 2020-09-10 2021-03-23 Moore & Gasperecz Global Inc. Method and system for identifying citations within regulatory content
CN113128195A (en) * 2021-04-23 2021-07-16 达而观信息科技(上海)有限公司 Method and device for automatically searching local difference points based on document structure in financial industry
US11232358B1 (en) 2020-11-09 2022-01-25 Moore & Gasperecz Global Inc. Task specific processing of regulatory content
US11314922B1 (en) 2020-11-27 2022-04-26 Moore & Gasperecz Global Inc. System and method for generating regulatory content requirement descriptions
US11763321B2 (en) 2018-09-07 2023-09-19 Moore And Gasperecz Global, Inc. Systems and methods for extracting requirements from regulatory content
US11823477B1 (en) 2022-08-30 2023-11-21 Moore And Gasperecz Global, Inc. Method and system for extracting data from tables within regulatory content
CN117473980A (en) * 2023-11-10 2024-01-30 中国医学科学院医学信息研究所 Structured analysis method of portable document format file and related products

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11763321B2 (en) 2018-09-07 2023-09-19 Moore And Gasperecz Global, Inc. Systems and methods for extracting requirements from regulatory content
CN111382561A (en) * 2020-03-13 2020-07-07 北大方正集团有限公司 File verification method, device, equipment and storage medium
CN111382561B (en) * 2020-03-13 2022-11-01 北大方正集团有限公司 File verification method, device, equipment and storage medium
CN111626036A (en) * 2020-05-27 2020-09-04 南京蓝鲸人网络科技有限公司 Novel image-text typesetting processing method
CN111626036B (en) * 2020-05-27 2021-04-30 南京蓝鲸人网络科技有限公司 Image-text typesetting processing method
US10956673B1 (en) 2020-09-10 2021-03-23 Moore & Gasperecz Global Inc. Method and system for identifying citations within regulatory content
US11232358B1 (en) 2020-11-09 2022-01-25 Moore & Gasperecz Global Inc. Task specific processing of regulatory content
US11314922B1 (en) 2020-11-27 2022-04-26 Moore & Gasperecz Global Inc. System and method for generating regulatory content requirement descriptions
CN113128195A (en) * 2021-04-23 2021-07-16 达而观信息科技(上海)有限公司 Method and device for automatically searching local difference points based on document structure in financial industry
US11823477B1 (en) 2022-08-30 2023-11-21 Moore And Gasperecz Global, Inc. Method and system for extracting data from tables within regulatory content
CN117473980A (en) * 2023-11-10 2024-01-30 中国医学科学院医学信息研究所 Structured analysis method of portable document format file and related products

Similar Documents

Publication Publication Date Title
CN110705223A (en) Footnote recognition and extraction method for multi-page layout document
Clark et al. Pdffigures 2.0: Mining figures from research papers
Kleber et al. Cvl-database: An off-line database for writer retrieval, writer identification and word spotting
CN106250830B (en) Digital book structured analysis processing method
JP3282860B2 (en) Apparatus for processing digital images of text on documents
CN109933796B (en) Method and device for extracting key information of bulletin text
US5748805A (en) Method and apparatus for supplementing significant portions of a document selected without document image decoding with retrieved information
JP3292388B2 (en) Method and apparatus for summarizing a document without decoding the document image
EP0544433B1 (en) Method and apparatus for document image processing
CN110704570A (en) Continuous page layout document structured information extraction method
EP1907946B1 (en) A method for finding text reading order in a document
US9008425B2 (en) Detection of numbered captions
Ma et al. Adaptive Hindi OCR using generalized Hausdorff image comparison
Bai et al. Keyword spotting in document images through word shape coding
Nurminen Algorithmic extraction of data in tables in PDF documents
JP2007122403A (en) Device, method, and program for automatically extracting document title and relevant information
CN112395851A (en) Text comparison method and device, computer equipment and readable storage medium
Huang et al. Associating text and graphics for scientific chart understanding
CN112487293A (en) Method, device and medium for extracting safety accident case structured information
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
Marinai Text retrieval from early printed books
CN110705224A (en) Plate center identification and alignment method for multi-page layout document
Kumar et al. Line based robust script identification for indianlanguages
JPH11232439A (en) Document picture structure analysis method
CN116324910A (en) Method and system for performing image-to-text conversion on a device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200117