CN110705223A

CN110705223A - Footnote recognition and extraction method for multi-page layout document

Info

Publication number: CN110705223A
Application number: CN201910743304.XA
Authority: CN
Inventors: 徐剑波; 张诗玉; 王磊
Original assignee: Beijing Zhongxin Boya Technology Co Ltd
Current assignee: Beijing Zhongxin Boya Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2020-01-17

Abstract

The invention relates to a footnote identification and extraction method for a multi-page layout document, and belongs to the field of information extraction. The method extracts character blocks and line segments in the format file, extracts character blocks with character size, typesetting and other characteristics meeting conditions in the page as footnote character candidates, compares small characters in the text with characters in the footnote area to improve the accuracy of candidate footnote quotation, extracts footnote lines, distinguishes initial footnote lines and subsequent footnote lines, cross-verifies the footnote lines and the footnote characters to eliminate interference characters and interference line segments, and finally confirms the footnote area. The method can automatically extract the footnotes from the layout documents of a plurality of pages, is suitable for different typesetting styles, is suitable for the conditions that the footnotes are at the bottom of the page or at the tail of an article and the like, and ensures high accuracy.

Description

Footnote recognition and extraction method for multi-page layout document

Technical Field

The invention relates to the field of format document information extraction, in particular to a footnote identification and extraction method for a multi-page format document.

Background

The format document format is an electronic document format with a fixed layout presentation effect, the presentation of the format document is independent of equipment, and the presentation results of the layout are consistent when the layout document is read, printed or printed on various equipment. The layout document is mainly applied to publishing, spreading and archiving of the files after the files become text. Common format document formats are PDF, CEBX, OFD, etc. The format document format defines information such as the presentation data of the layout of a plurality of pages, the presentation position, the color, the font size and the like of each page internal object (characters, images, graphics and the like) in the layout, so that the parser and the reader can present document contents page by page according to the format, and people can read the document contents conveniently. In one document, cross-references (e.g., footnotes, references, charts) are used to label and further explain their corresponding entities. A cross-reference comprises two parts: references and entities. For example, for footnotes, a reference refers to a reference in the body that generally appears in superscript form, and an entity refers to text at the bottom of the page or later that further explains where the body corresponds to the reference. As one of the cross references, identifying footnotes plays an important role for the core task of document structure understanding, page element type annotation. The content information contained in the footnote itself and the link correspondence information are also helpful for the understanding of the document content. However, the format document does not describe such a reference relationship, and when the format document is structured, it is necessary to restore the reference relationship, that is, to identify the footnotes and extract the entities.

In the prior art, the research for extracting footnotes from documents is not many: anjewierden describes a document analysis system named AIDAS that employs an incremental bottom-up extraction of the logical structure of a document (including footnotes, graphic titles, etc.), but the syntax used by the system depends on the particular document type. Marinai et al extracted footnotes during the conversion of PDF to EPUB format. They recognize as a reference to the footnote a number having a font size that is less than a certain proportion of the body word. Under such assumptions, non-numeric references cannot be correctly identified, while formula superscripts may be misidentified as references to footnotes. At the same time, they identify paragraphs that begin with numbers and have body words smaller than the document body words as entities of footnotes. But not all types of footnotes have a font size smaller than the document body word size, and the method may misidentify ordered lists and directories as footnotes. Patent document "a method for identifying footnotes in format documents and a method for associating the footnotes with the footnotes citations (application number CN 102015000342271)" describes a method for identifying footnotes based on feature clustering, which obtains style features of the footnotes in documents by means of feature clustering, so that the identification process can adapt to documents of different styles without depending on specific features and rules, and then associates the footnotes with the footnotes citations. However, due to the diversity of the typesetting style, the interference between the list and the footnotes may exist, so that the footnote regions cannot be normally identified due to the fact that effective clustering cannot be performed. In addition, the invention assumes that the footnotes are positioned at the tail of the page, and in reality, a large number of documents of the footnotes are positioned at the tail of the whole article.

Therefore, a person skilled in the art needs to extract footnotes from the layout document automatically to adapt to different typesetting styles, and to improve the accuracy of footnote extraction by using footnote areas possibly located at the bottom of a page or at the tail of an article.

Disclosure of Invention

In order to solve the problems in the prior art and achieve the purposes, the invention adopts the technical scheme that: a method for recognizing and extracting footnotes of a multi-page layout document.

The technical scheme adopted by the invention is as follows: a footnote identification and extraction method for a multi-page layout document comprises the following steps:

analyzing a format document, and acquiring page information of the format document, characters in a page and path information page by page;

step two, preprocessing, identifying header and footer, and taking out the header and footer from each page; counting the size distribution of character sizes in each page, and taking the most appeared character sizes as the character sizes of the text; counting the coordinate information of the character lines in each page;

step three, traversing page by page, and extracting the script character blocks in the page, wherein the specific steps are as follows: traversing character blocks in a page, and adding the character blocks meeting the following conditions into a corresponding set; if the font size of the text block is smaller than the font size of the text, the proportionality coefficient is set to be 0.8, and the text block is aligned with the left boundary of the page layout center, adding the text block into the set matched _ objs; if the word size of a word block is smaller than the word size of the text and the word block is not aligned with the left boundary of the page version center, adding the word block into a set inline _ objs, if the word size of the word block does not meet the requirement and the word block is aligned with the left boundary of the page version center, adding the word block into a set candi _ lines, if the set inline _ objs is not empty and the matched _ objs is empty, pairwise matching the inline _ objs and the candi _ lines, if the head word string of the word block marked as line in the candi _ lines is consistent with the word block obj of the inline _ objs, adding the line into the matched _ objs set, removing the obj from the inline _ objs until the beginning word string of the inline _ objs is empty or comparing the candi _ lines completely, and taking the copied _ objs as a footnote word set of the page;

step four, extracting the footmark line of the page, starting from the tail of the page, searching a horizontal line which simultaneously meets the following conditions to serve as a footmark line candidate set, marking the footmark line as an initial footmark line when the length of the footmark line is close to 144 with the left boundary footmark line of the page center, marking the length of the footmark line as the width of the page center as a continuous footmark line, preferably selecting one of the footmark line candidate set as the footmark line according to the following rules, if the footmark line candidate set is empty, if a footmark line block set smallobjs is empty, taking the first of the footmark line candidate set as the footmark line, if the footmark line candidate set and the footmark word block set are both non-empty, checking whether conflicts exist or not to remove the interferences, and checking: and (4) traversing the candidate foot-notation line sets one by setting the number of the candidate foot-notation lines as candi _ count, and dividing the foot-notation character block into an upper set above _ obj and a lower set below _ obj by taking each candidate foot-notation line as a dividing line. The traversal is aborted until above _ obj is empty. A total of n segmentation schemes are obtained. If n is 1, only one effective segmentation is carried out, if n is greater than 1, the candidate foot mark line corresponding to the segmentation is taken as the selected foot mark line, and if n is greater than 1, the candidate foot mark line corresponding to the last segmentation scheme is taken as the selected foot mark line for verification and confirmation of the foot mark area, and the method specifically comprises the following steps: if the footmark line is not found, the page has no footmark, if the footmark line is found, the area below the footmark line is used as a candidate area of the footmark, further verification is carried out, the character size distribution of characters in the candidate area is extracted, the character size distribution with the most character appearing is used as the footmark character size, if a footmark character block set smallobjs is empty, if the current footmark line is the footmark line of a subsequent page and the difference between the footmark character size and the footmark character size of the previous page is larger, the footmark is rejected and returned according to the footmark, if the footmark character block set smallobjs is not empty, if the current footmark line is the initial footmark line and no character block below the current footmark line exists in the smallobjs, the footmark is rejected and returned according to the footmark;

and step five, after the examination, confirming that the candidate area is the footnote area.

The method has the advantages that character blocks and line segments in the format file are extracted, the character blocks with character sizes, typesetting and other characteristics meeting conditions in the page are extracted to serve as footnote character candidates, the small characters in the text are compared with the characters in the footnote area to improve the accuracy of candidate footnote reference, meanwhile, the footnote lines are extracted, the initial footnote lines and the subsequent footnote lines are distinguished, the footnote lines and the footnote characters are verified in a cross mode to eliminate interference characters and interference line segments, and finally the footnote area is confirmed. The method can automatically extract the footnotes from the layout documents of a plurality of pages, is suitable for different typesetting styles, is suitable for the conditions that the footnotes are at the bottom of the page or at the tail of an article and the like, and ensures high accuracy.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein: FIG. 1 is a schematic diagram of steps of a method for identifying and extracting footnotes of a multi-page layout document.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description-a method for footnote recognition and extraction of a multi-page layout document includes the steps of:

1. analyzing the format document, and acquiring page information and character and path information in the page by page, wherein:

a) the page information includes page size information

b) The text block information includes information such as code, color, position (enclosing rectangle), font size, italic bold, etc. in the sub-characters, and the text block set is recorded as raw _ obj

c) Original output sequence number of text block in layout document (marked idx)

d) Path information (path, or line segment, drawn by the instruction in the layout document)

2. Performing a pretreatment

a) Identifying header footers and extracting header footers from individual pages

b) Counting the size distribution of the character size in each page, and taking the most appeared character size as the character size of the text

c) Counting coordinate information of lines of characters in each page

3. Traversing page by page, and extracting script character blocks (smallobjs) in the page, which comprises the following specific steps:

a) traversing character blocks in the page, and adding the character blocks meeting the following conditions into a corresponding set

i. If the block size is less than a certain proportion of the block size and the block is aligned with the left border of the page block center, then the block is added to the collection (formatted _ obj). The scaling factor can be empirically set to 0.8

if the block size is less than a certain proportion of the block size, the scaling factor can be empirically set to 0.8, and the block is not aligned with the left border of the page layout center, then add the block to the set (inline _ objs)

if the block size does not meet the above requirement but the block is aligned with the left border of the page block center, add the block to the collection (candi lines)

b) If the set inline _ objs is not empty and matched _ objs is empty, then inline _ objs and candi _ lines are pairwise matched

i. If the first character string of character blocks (line) in candi _ lines matches the character block (obj) of inline _ obj, the line is added to the matched _ obj set, and obj is removed from inline _ obj

ii, until inline _ objs is empty, or candi _ lines full alignment is completed

c) Taking formatted _ objs as a footnote text block set of the page (marked as smallobjs)

4. The method for extracting the footnotes of the page comprises the following specific steps:

a) starting from the tail of the page, horizontal lines satisfying the following conditions are searched as a candidate set (ordered set) of the footmark lines

i. The footmark line is close to 144 (marked as a starting footmark line) in length with the left boundary footmark line of the page center;

ii, the length of the footmark line is the width of the page plate center (marked as a continuous footmark line);

b) one of the footmark candidate sets is preferably selected as the footmark according to the following rule;

i. if the candidate set of the footmark lines is empty, no footmark line exists;

if the set of footnoted character blocks (smallobjs) is empty, then treating the first of the candidate set of footmarks as a footmark;

and iii, when the footmark line candidate set and the footmark character block set are not empty, checking whether conflicts exist or not so as to remove interference. And (3) checking: (let the number of candidate footmark lines be candi _ count)

1. And traversing the candidate footmark line sets one by one, taking each candidate footmark line as a dividing line, and dividing the footmark character block into an upper set and a lower set (above _ obj, below _ obj). The traversal is aborted until above _ obj is empty. Obtaining n segmentation schemes in total;

2. if n is 1, only one effective segmentation is carried out, and the candidate footmark line corresponding to the segmentation is taken as the selected footmark line;

3. if n is greater than 1, selecting a candidate footmark line corresponding to the last segmentation scheme as a selected footmark line;

5. verifying and confirming the footnote area, which comprises the following specific steps:

a) if the footnote line is not found, the page has no footnotes;

b) if finding the footmark line, taking the area below the footmark line as a candidate area of the footmark, and further checking;

i. extracting the font size distribution of characters in the candidate region, and taking the font size with the most occurrence as the footnote font size;

if the footnote character block set (smallobjs) is empty, if the current footnote line is the footnote line of the subsequent page and the difference between the footnote character number and the footnote character number of the previous page is larger, rejecting the footnote and returning according to no footnote;

if the footnote character block set (smallobjs) is not empty, if the previous footnote line is the initial footnote line and the character block below the current footnote line does not exist in the smallobjs, rejecting the footnote and returning according to the footnote absence;

c) after the examination, the candidate region is confirmed as the footnote region.

Claims

1. A footnote identification and extraction method for a multi-page layout document is characterized by comprising the following steps: analyzing a format document, and acquiring page information of the format document, characters in a page and path information page by page; step two, preprocessing, identifying header and footer, and taking out the header and footer from each page; counting the size distribution of character sizes in each page, and taking the most appeared character sizes as the character sizes of the text; counting the coordinate information of the character lines in each page; step three, traversing page by page, and extracting script character blocks in the page; step four, extracting footnotes of the page; and step five, verifying and confirming the footnote area.

2. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the third step comprises the following specific steps: adding the character blocks meeting the conditions into the corresponding set; if the word size of the text block is smaller than the text word size and the text block is aligned with the left boundary of the page layout center, adding the text block into the set matched _ obj, if the word size of the text block is smaller than the text word size and the text block is not aligned with the left boundary of the page layout center, adding the text block into the set inline _ obj, if the word size of the text block does not meet the requirement, but the text block is aligned with the left boundary of the page layout center, adding the text block into the set candi _ lines, and if the set inline _ obj is not empty and the matched _ obj is empty, matching the inline _ obj and the candi _ lines two by two; if the head character string of the character blocks in candi _ lines marked as line is consistent with the character block obj of inline _ obj, the line is added into the matched _ obj set, the obj is removed from the inline _ obj until the inline _ obj is empty or the candi _ lines are completely compared, and the matched _ obj is taken as the footed character block set of the page and marked as smallobjs.

3. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the fourth step comprises the following specific steps: starting from the tail of the page, searching horizontal lines which simultaneously meet the conditions as a candidate set of the footmark lines; marking the length of the footmark line close to 144 of the left boundary footmark line of the page layout center as a starting footmark line, marking the length as the width of the page layout center as a continuous footmark line, and preferably selecting one of the footmark lines from the footmark line candidate set as the footmark line according to the following rules; if the candidate set of the footmark lines is empty, no footmark line exists, if the candidate set of the footmark lines is empty, the first one of the candidate set of the footmark lines is taken as the footmark line, and if the candidate set of the footmark lines and the candidate set of the footmark text blocks are not empty, whether conflicts exist or not needs to be checked to remove interference; and (3) checking: setting the number of candidate footnotes as candi _ count, traversing the candidate footnote line sets one by one, taking each candidate footnote line as a dividing line, dividing the footnote character block into an upper set above and a lower set above _ obj and below _ obj, and stopping traversal until the above _ obj is empty; and obtaining n segmentation schemes in total, if n is 1, only one effective segmentation is carried out, the candidate footmark line corresponding to the segmentation is taken as the selected footmark line, and if n is greater than 1, the candidate footmark line corresponding to the last segmentation scheme is taken as the selected footmark line.

4. The footnote recognition and extraction method of a multi-page layout document according to claim 1, wherein; the concrete steps of the fifth step are as follows: if the footmark line is not found, the page has no footmark, if the footmark line is found, the area below the footmark line is used as a candidate area of the footmark, the checking is further carried out, the character size distribution of characters in the candidate area is extracted, the character size distribution with the most character appearing is used as the footmark character size, if a footmark character block set smallobjs is empty, if the current footmark line is the footmark line of the subsequent page and the difference between the footmark character size and the footmark character size of the previous page is larger, the footmark is rejected and returned according to the footmark, if the footmark character block set smallobjs is not empty, if the current footmark line is the initial footmark line and no character block below the current footmark line exists in the smallobs, the footmark is rejected and returned according to the footmark, and after the checking, the candidate area is determined to be the footmark area.