Document segmentation method and device based on text line
Technical Field
The invention relates to the field of text processing, in particular to a text line-based document segmentation method and device.
Background
As technology develops, more and more text processing depends on automatic machine implementation, and existing document formats include PDF and PDF-like HTML documents, where text in the documents is formed by lines, rather than being directly combined into paragraphs, and only the effect of segmentation when a person reads the documents is guaranteed by visual styles. In order to make the computer automatically integrate the text lines in these documents into text paragraphs, and facilitate the subsequent further processing of the text content in paragraph units, a feasible solution is proposed.
In the existing PDF and HTML text extraction, many texts are directly output according to lines without segmentation; or segmentation based on finding extra empty rows, this is not very common and is not very friendly to the table output in the document.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problems in the prior art, a document segmentation method and a document segmentation device based on text lines are provided. Judging whether the text line units are combined into the same paragraph or not by combining scores of the text line units, and ending the combination of the current paragraph and starting a new paragraph when the scores of the text lines do not meet the combination. The method simply and effectively solves the problems in the prior art.
The technical scheme adopted by the invention is as follows:
a document segmentation method based on text lines is characterized by comprising the following steps:
step 1: analyzing data of a document format formed by text lines, extracting pages and document data structures, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
step 2: and segmenting the text line units in each page according to a segmentation algorithm by combining the full text context and the page context information according to the n text line unit structure lists in each page.
Further, the step 2 specifically includes:
step 21: skipping over text line units without content;
step 22: setting a merging score with an initial value of 0 for each text line unit, and merging the text line unit with a previous text line unit if the merging score is greater than 0 when the process is ended;
step 23: setting a segment merging buffer merge _ buffer, temporarily storing the text line unit determined to be merged in the buffer before determining that the merging of the whole segment is finished, and storing the font format of the text line in a set;
step 24: and when the merging fraction of the new text line is detected to be less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting the segmentation of the new paragraph.
Further, the merging score calculating process is:
step 31: if the font format of the current line exists in the font format set of the segment merging cache, increasing the segment merging fraction by 5;
step 32: gap _ top judges whether two adjacent text units are in the same line or not according to the distance curr _ line of the current line from the top of the page, if so, the two adjacent text units are in the same line
-1≤curr_line.gap_top<min(gap_top_avg-gap_top_std,8)
Then two adjacent text units are in the same line, curr _ line, merge _ score + (10), go to step 32; otherwise, the two adjacent text units are not in the same line, and step 33 is executed;
step 33: detecting a line distance between texts, if | curr _ line.gap _ top-prev _ line.gap _ top | > gap _ top _ std, the line distance of a text line unit is too large, which indicates that two lines are not likely to belong to the same paragraph, curr _ line.merge _ score ═ 10, and executing step 34 and step 35; otherwise, curr _ line, merge _ score value is unchanged;
step 34: traversing the previous line from merge _ buffer to find a text line unit prev _ line _ start at the most front end; judging whether the left indents of the two lines are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left <gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
step 35: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than that in the previous line, i.e., if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; else curr line merge score value is unchanged.
A text line-based document segmentation apparatus comprising:
the text line information acquisition module: extracting a page and a document data structure, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
a segmentation and combination module: and segmenting the text line units in each page according to a segmentation algorithm by combining the context information according to the n text line unit structure lists in each page acquired by the text line information acquisition module.
The segmentation and combination module specifically comprises:
a no content unit processing module: skipping over text line units without content;
text line unit merging module: firstly, a segment merging buffer is set, before the merging of the whole segment is determined to be finished, the text line unit determined to be merged is temporarily stored in the buffer, and the font format of the text line is stored in the set. Then setting a segment merging score with an initial value of 0 for each text line unit, merging the text line unit with the previous text line unit if the merging score of the new text line is greater than 0 when the process is ended, and putting the text line unit into a segment merging cache; and if the merging fraction of the new text line is detected to be less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting the segmentation of the new paragraph.
Further, the merging score calculating process is:
the font format judging module: line font exists in the set font set, then current line merge score + is 5;
the same-line text line unit judgment module: judging whether two adjacent text units are in the same line or not through curr _ line
-1≤curr_line.gap_top<min(gap_top_avg-gap_top_std,8),
Then two adjacent text units are in the same line, curr _ line, merge _ score + (10), execute the text line unit merging module in the same line; otherwise, the two adjacent text units are not in the same line, and the text line unit judgment module of the same paragraph is executed;
the same paragraph text line unit judgment module: detecting line spacing between texts, if | curr _ line.gap _ top-prev _ line.gap _ top | > gap _ top _ std, indicating that two lines are probably not in the same paragraph if the line spacing of a text line unit is too large, and executing a text line unit indentation judging module and a font checking module, wherein merge _ score-is 10; otherwise, curr _ line, merge _ score value is unchanged;
text line unit indentation judging module: traversing the previous line from merge _ buffer to find the text block prev _ line _ start at the most front end; judging whether the left indents of the two text units are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left <gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
the font checking module: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than that in the previous line, i.e., if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; else curr line merge score value is unchanged.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. compared with a mode of directly segmenting according to the idle line, the method is more accurate and has better adaptability;
2. compared with manual segmentation, the performance is greatly improved, and the correctness is enough to meet most of actual requirements.
3. Besides general paragraphs, document structures such as lists and tables can be well supported.
4. The method and the device simultaneously consider information such as text density, appearance style, text content, context and the like to perform segmentation processing on the document consisting of the text lines, and the accuracy of segmentation can reach more than 80%.
Description of the drawings:
FIG. 1 is a flow chart of the present invention.
Detailed Description
All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.
Any feature disclosed in this specification may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.
The invention is explained in relation to:
1. the non-content means space, carriage return and other non-content text line units;
2. gap _ top refers to the line spacing between the current text line unit and the adjacent text line unit;
3. gap _ top refers to the line spacing between a previous text line unit and an adjacent text line unit;
4. prev _ line _ start refers to the top text line unit;
5. gap left refers to the indentation difference between the previous text line unit and the adjacent text line;
6. gap left refers to the indentation difference between the current text line unit and the adjacent text line.
7. Line len refers to the current line unit word count;
8. line len refers to the previous text line unit word count;
9. line font refers to the font format of the current line of text unit;
10. merge score refers to the merging score of the current line element as before.
The invention has the following implementation process:
1. reading a PDF document, and processing by using a tool for converting PDF into HTML; analyzing the HTML document, extracting a page and a document data structure, and extracting the text line information from the document data structure corresponding to each text line; generally, the method includes obtaining a value according to an attribute value of a p or span label, for example, top corresponds to line _ off _ top, left corresponds to line _ off _ left, and a font style uses a class attribute of the label; text line information can be extracted from a document data structure corresponding to each text line, and the text line information comprises:
a) word number line _ len;
b) an offset value line _ off _ left from the left side of the page;
c) an offset value line _ off _ top from the upper end of the page;
d) subtracting the line pitch gap _ top at the upper end of the previous line, namely the line _ off _ top of the current line and the previous line;
e) subtracting the retracting distance gap _ left of the previous line, namely the line _ off _ left of the current line and the previous line;
f) the font format line _ font;
2. traversing each text line data structure containing text line units of the full text, and respectively counting the context information of the full text and each page according to a text line unit structure list formed by the text line information of the document data structure, wherein the method comprises the following steps:
a) the full text context traverses the text line units of the full text, and obviously calculates:
i. offset value from left side of page
A. Average off _ left _ avg
B. Standard deviation off _ left _ std
Offset value from the upper end of the page
A. Average off _ top _ avg
B. Standard deviation off _ top _ std
Word count of lines of text
A. Average value len _ avg
B. Standard deviation len _ std
Difference in word count between two adjacent rows
A. Average value len _ diff _ avg
B. Standard deviation len _ diff _ std
Statistical values of the line spacing between adjacent text lines, i.e. gap _ top
A. Average value gap _ top _ avg
B. Standard deviation gap _ top _ std
Statistical values of the indentation difference between adjacent text lines, i.e. gap _ left
A. Average value gap _ left _ avg
B. Standard deviation gap _ left _ std
b) Page context:
i. page width page _ width
Page height
List page _ line _ list of textual line unit structures in a page
3. For each page, according to the text line unit structure list in the page and the page context information, the segmentation is executed according to the following algorithm:
a) skipping over a text line unit without content, which is equivalent to directly enlarging the line spacing between two text lines with content, thereby improving the merging accuracy;
b) setting a segment merging score merge _ score with an initial value of 0 for each text line unit, and when the process is ended, if the merge _ score is larger than 0, merging the text line unit with the previous text line;
c) setting a segment merging buffer merge _ buffer, temporarily storing the text line unit determined to be merged in the buffer before determining that the merging of the whole segment is finished, and storing the font format of the text line in a set font _ set;
d) and when detecting that the number of the new text lines merge _ score is less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting a new paragraph.
e) The specific algorithm of the score is executed by combining the context information and the text line unit information in the segment merging cache, the current text line unit is curr _ line, the previous text line unit is prev _ line, and the merging score calculation process is as follows:
step e 1): determining whether two adjacent text units are in the same line through curr _ line, gap _ top < min (gap _ top _ avg-gap-top-std, 8), if-1 ≦ curr _ line, gap _ top ≦ 10, and adding this determination because the visually same line may actually be composed of text units of multiple short lines in a style such as a table, executing step e 2; otherwise, the two adjacent text units are not in the same line, and step e3 is executed;
step e 2): if current _ line _ font is in font _ set, current _ line _ merge _ score + (5), this item indicates that the format of the new line text is consistent with the previous line, no style switching occurs, and it is obvious that there is a greater probability that the new line text belongs to the same-format text;
step e 3): if the adjacent text units are not in the same line, detecting the line spacing between the texts, if the line spacing is too large, indicating that the two lines are probably not in the same paragraph, and if the line spacing is too large, indicating that the two lines are probably not in the same paragraph
Curr _ line, gap _ top-prev _ line, gap _ top > gap _ top _ std, then curr _ line, merge _ score-10, execute step e4 and step e 5;
step e 4): traversing the previous line from merge _ buffer to find the text block prev _ line _ start at the most front end; judging whether the left indents of the two lines are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left<gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
step e5) word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly greater than the number of words in the previous line, i.e. if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; otherwise, curr line merge score value is unchanged.
Before segmentation, some contents irrelevant to the text, such as headers and footers, can be cleaned; the cleaning method mainly adopts a space which is within 15% of the topmost or bottommost part of a page according to the content of a header footer, namely, the proportion between page _ height and line _ top is detected, the page _ height and the line _ top can repeatedly appear in different pages, and the style is usually obviously different from the text content.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.