CN107391457B - Document segmentation method and device based on text line - Google Patents

Document segmentation method and device based on text line Download PDF

Info

Publication number
CN107391457B
CN107391457B CN201710616443.7A CN201710616443A CN107391457B CN 107391457 B CN107391457 B CN 107391457B CN 201710616443 A CN201710616443 A CN 201710616443A CN 107391457 B CN107391457 B CN 107391457B
Authority
CN
China
Prior art keywords
line
text
gap
text line
merge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710616443.7A
Other languages
Chinese (zh)
Other versions
CN107391457A (en
Inventor
林康
罗鹰
张鑫阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kelai Network Technology Co.,Ltd.
Original Assignee
Colasoft Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Colasoft Co ltd filed Critical Colasoft Co ltd
Priority to CN201710616443.7A priority Critical patent/CN107391457B/en
Publication of CN107391457A publication Critical patent/CN107391457A/en
Application granted granted Critical
Publication of CN107391457B publication Critical patent/CN107391457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Abstract

The invention relates to the field of text processing, and provides a text line-based document segmentation method and device aiming at the problems in the prior art. Judging whether the text line units are combined into the same paragraph or not by combining scores of the text line units, and ending the combination of the current paragraph and starting a new paragraph when the scores of the text lines do not meet the combination. The method simply and effectively solves the problems in the prior art. Extracting page and document data structures, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines in the full text, respectively counting and calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures, and segmenting the text line units in each page according to a segmentation algorithm according to n text line unit structure lists in each page and other text line information.

Description

Document segmentation method and device based on text line
Technical Field
The invention relates to the field of text processing, in particular to a text line-based document segmentation method and device.
Background
As technology develops, more and more text processing depends on automatic machine implementation, and existing document formats include PDF and PDF-like HTML documents, where text in the documents is formed by lines, rather than being directly combined into paragraphs, and only the effect of segmentation when a person reads the documents is guaranteed by visual styles. In order to make the computer automatically integrate the text lines in these documents into text paragraphs, and facilitate the subsequent further processing of the text content in paragraph units, a feasible solution is proposed.
In the existing PDF and HTML text extraction, many texts are directly output according to lines without segmentation; or segmentation based on finding extra empty rows, this is not very common and is not very friendly to the table output in the document.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problems in the prior art, a document segmentation method and a document segmentation device based on text lines are provided. Judging whether the text line units are combined into the same paragraph or not by combining scores of the text line units, and ending the combination of the current paragraph and starting a new paragraph when the scores of the text lines do not meet the combination. The method simply and effectively solves the problems in the prior art.
The technical scheme adopted by the invention is as follows:
a document segmentation method based on text lines is characterized by comprising the following steps:
step 1: analyzing data of a document format formed by text lines, extracting pages and document data structures, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
step 2: and segmenting the text line units in each page according to a segmentation algorithm by combining the full text context and the page context information according to the n text line unit structure lists in each page.
Further, the step 2 specifically includes:
step 21: skipping over text line units without content;
step 22: setting a merging score with an initial value of 0 for each text line unit, and merging the text line unit with a previous text line unit if the merging score is greater than 0 when the process is ended;
step 23: setting a segment merging buffer merge _ buffer, temporarily storing the text line unit determined to be merged in the buffer before determining that the merging of the whole segment is finished, and storing the font format of the text line in a set;
step 24: and when the merging fraction of the new text line is detected to be less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting the segmentation of the new paragraph.
Further, the merging score calculating process is:
step 31: if the font format of the current line exists in the font format set of the segment merging cache, increasing the segment merging fraction by 5;
step 32: gap _ top judges whether two adjacent text units are in the same line or not according to the distance curr _ line of the current line from the top of the page, if so, the two adjacent text units are in the same line
-1≤curr_line.gap_top<min(gap_top_avg-gap_top_std,8)
Then two adjacent text units are in the same line, curr _ line, merge _ score + (10), go to step 32; otherwise, the two adjacent text units are not in the same line, and step 33 is executed;
step 33: detecting a line distance between texts, if | curr _ line.gap _ top-prev _ line.gap _ top | > gap _ top _ std, the line distance of a text line unit is too large, which indicates that two lines are not likely to belong to the same paragraph, curr _ line.merge _ score ═ 10, and executing step 34 and step 35; otherwise, curr _ line, merge _ score value is unchanged;
step 34: traversing the previous line from merge _ buffer to find a text line unit prev _ line _ start at the most front end; judging whether the left indents of the two lines are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left <gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
step 35: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than that in the previous line, i.e., if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; else curr line merge score value is unchanged.
A text line-based document segmentation apparatus comprising:
the text line information acquisition module: extracting a page and a document data structure, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
a segmentation and combination module: and segmenting the text line units in each page according to a segmentation algorithm by combining the context information according to the n text line unit structure lists in each page acquired by the text line information acquisition module.
The segmentation and combination module specifically comprises:
a no content unit processing module: skipping over text line units without content;
text line unit merging module: firstly, a segment merging buffer is set, before the merging of the whole segment is determined to be finished, the text line unit determined to be merged is temporarily stored in the buffer, and the font format of the text line is stored in the set. Then setting a segment merging score with an initial value of 0 for each text line unit, merging the text line unit with the previous text line unit if the merging score of the new text line is greater than 0 when the process is ended, and putting the text line unit into a segment merging cache; and if the merging fraction of the new text line is detected to be less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting the segmentation of the new paragraph.
Further, the merging score calculating process is:
the font format judging module: line font exists in the set font set, then current line merge score + is 5;
the same-line text line unit judgment module: judging whether two adjacent text units are in the same line or not through curr _ line
-1≤curr_line.gap_top<min(gap_top_avg-gap_top_std,8),
Then two adjacent text units are in the same line, curr _ line, merge _ score + (10), execute the text line unit merging module in the same line; otherwise, the two adjacent text units are not in the same line, and the text line unit judgment module of the same paragraph is executed;
the same paragraph text line unit judgment module: detecting line spacing between texts, if | curr _ line.gap _ top-prev _ line.gap _ top | > gap _ top _ std, indicating that two lines are probably not in the same paragraph if the line spacing of a text line unit is too large, and executing a text line unit indentation judging module and a font checking module, wherein merge _ score-is 10; otherwise, curr _ line, merge _ score value is unchanged;
text line unit indentation judging module: traversing the previous line from merge _ buffer to find the text block prev _ line _ start at the most front end; judging whether the left indents of the two text units are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left <gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
the font checking module: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than that in the previous line, i.e., if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; else curr line merge score value is unchanged.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. compared with a mode of directly segmenting according to the idle line, the method is more accurate and has better adaptability;
2. compared with manual segmentation, the performance is greatly improved, and the correctness is enough to meet most of actual requirements.
3. Besides general paragraphs, document structures such as lists and tables can be well supported.
4. The method and the device simultaneously consider information such as text density, appearance style, text content, context and the like to perform segmentation processing on the document consisting of the text lines, and the accuracy of segmentation can reach more than 80%.
Description of the drawings:
FIG. 1 is a flow chart of the present invention.
Detailed Description
All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.
Any feature disclosed in this specification may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.
The invention is explained in relation to:
1. the non-content means space, carriage return and other non-content text line units;
2. gap _ top refers to the line spacing between the current text line unit and the adjacent text line unit;
3. gap _ top refers to the line spacing between a previous text line unit and an adjacent text line unit;
4. prev _ line _ start refers to the top text line unit;
5. gap left refers to the indentation difference between the previous text line unit and the adjacent text line;
6. gap left refers to the indentation difference between the current text line unit and the adjacent text line.
7. Line len refers to the current line unit word count;
8. line len refers to the previous text line unit word count;
9. line font refers to the font format of the current line of text unit;
10. merge score refers to the merging score of the current line element as before.
The invention has the following implementation process:
1. reading a PDF document, and processing by using a tool for converting PDF into HTML; analyzing the HTML document, extracting a page and a document data structure, and extracting the text line information from the document data structure corresponding to each text line; generally, the method includes obtaining a value according to an attribute value of a p or span label, for example, top corresponds to line _ off _ top, left corresponds to line _ off _ left, and a font style uses a class attribute of the label; text line information can be extracted from a document data structure corresponding to each text line, and the text line information comprises:
a) word number line _ len;
b) an offset value line _ off _ left from the left side of the page;
c) an offset value line _ off _ top from the upper end of the page;
d) subtracting the line pitch gap _ top at the upper end of the previous line, namely the line _ off _ top of the current line and the previous line;
e) subtracting the retracting distance gap _ left of the previous line, namely the line _ off _ left of the current line and the previous line;
f) the font format line _ font;
2. traversing each text line data structure containing text line units of the full text, and respectively counting the context information of the full text and each page according to a text line unit structure list formed by the text line information of the document data structure, wherein the method comprises the following steps:
a) the full text context traverses the text line units of the full text, and obviously calculates:
i. offset value from left side of page
A. Average off _ left _ avg
B. Standard deviation off _ left _ std
Offset value from the upper end of the page
A. Average off _ top _ avg
B. Standard deviation off _ top _ std
Word count of lines of text
A. Average value len _ avg
B. Standard deviation len _ std
Difference in word count between two adjacent rows
A. Average value len _ diff _ avg
B. Standard deviation len _ diff _ std
Statistical values of the line spacing between adjacent text lines, i.e. gap _ top
A. Average value gap _ top _ avg
B. Standard deviation gap _ top _ std
Statistical values of the indentation difference between adjacent text lines, i.e. gap _ left
A. Average value gap _ left _ avg
B. Standard deviation gap _ left _ std
b) Page context:
i. page width page _ width
Page height
List page _ line _ list of textual line unit structures in a page
3. For each page, according to the text line unit structure list in the page and the page context information, the segmentation is executed according to the following algorithm:
a) skipping over a text line unit without content, which is equivalent to directly enlarging the line spacing between two text lines with content, thereby improving the merging accuracy;
b) setting a segment merging score merge _ score with an initial value of 0 for each text line unit, and when the process is ended, if the merge _ score is larger than 0, merging the text line unit with the previous text line;
c) setting a segment merging buffer merge _ buffer, temporarily storing the text line unit determined to be merged in the buffer before determining that the merging of the whole segment is finished, and storing the font format of the text line in a set font _ set;
d) and when detecting that the number of the new text lines merge _ score is less than or equal to 0, ending the merging of the current paragraph, clearing the segment merging cache and starting a new paragraph.
e) The specific algorithm of the score is executed by combining the context information and the text line unit information in the segment merging cache, the current text line unit is curr _ line, the previous text line unit is prev _ line, and the merging score calculation process is as follows:
step e 1): determining whether two adjacent text units are in the same line through curr _ line, gap _ top < min (gap _ top _ avg-gap-top-std, 8), if-1 ≦ curr _ line, gap _ top ≦ 10, and adding this determination because the visually same line may actually be composed of text units of multiple short lines in a style such as a table, executing step e 2; otherwise, the two adjacent text units are not in the same line, and step e3 is executed;
step e 2): if current _ line _ font is in font _ set, current _ line _ merge _ score + (5), this item indicates that the format of the new line text is consistent with the previous line, no style switching occurs, and it is obvious that there is a greater probability that the new line text belongs to the same-format text;
step e 3): if the adjacent text units are not in the same line, detecting the line spacing between the texts, if the line spacing is too large, indicating that the two lines are probably not in the same paragraph, and if the line spacing is too large, indicating that the two lines are probably not in the same paragraph
Curr _ line, gap _ top-prev _ line, gap _ top > gap _ top _ std, then curr _ line, merge _ score-10, execute step e4 and step e 5;
step e 4): traversing the previous line from merge _ buffer to find the text block prev _ line _ start at the most front end; judging whether the left indents of the two lines are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left<gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
step e5) word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly greater than the number of words in the previous line, i.e. if curr _ line.line _ len-prev _ line.line _ len >2 × len _ std, curr _ line.merge _ score-5; otherwise, curr line merge score value is unchanged.
Before segmentation, some contents irrelevant to the text, such as headers and footers, can be cleaned; the cleaning method mainly adopts a space which is within 15% of the topmost or bottommost part of a page according to the content of a header footer, namely, the proportion between page _ height and line _ top is detected, the page _ height and the line _ top can repeatedly appear in different pages, and the style is usually obviously different from the text content.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims (2)

1. A document segmentation method based on text lines is characterized by comprising the following steps:
step 1: analyzing data of a document format formed by text lines, extracting pages and document data structures, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
step 2: segmenting text line units in each page according to a segmentation algorithm by combining full text context and page context information according to n text line information lists in each page; the step 2 specifically comprises:
step 21: skipping over text line units without content;
step 22: setting a segment merging score with an initial value of 0 for each text line unit, and merging the text line unit with a previous text line unit if the segment merging score is greater than 0 when the process is ended;
step 23: setting a segment merging buffer, temporarily storing the text line unit determined to be merged in the buffer before determining that the merging of the whole segment is finished, and storing the font format of the text line in a set;
step 24: when detecting that the segment merging fraction of the new text line is less than or equal to 0, ending the current segment merging, clearing the segment merging cache, and starting the new segment segmentation; the segment merging score calculation process is as follows:
step 31: if the font format of the current line exists in the font format set of the segment merging cache, increasing the segment merging fraction by 5;
step 32: judging whether two adjacent text units are in the same line or not according to the current line and the top distance curr _ line. gap _ top, if-1 is not more than curr _ line. gap _ top < min (gap _ top _ avg-gap _ top _ std,8)
Then two adjacent text units are in the same line, curr _ line, merge _ score + (10), go to step 32; otherwise, the two adjacent text units are not in the same line, and step 33 is executed;
step 33: detecting a line spacing between texts, if the line spacing of the text line unit is too large, it indicates that two lines are not likely to belong to the same paragraph, and curr _ line, merge _ score-10, and performing step 34 and step 35; otherwise, curr _ line, merge _ score value is unchanged, go to step 34;
step 34: traversing the previous line from merge _ buffer to find a text line unit prev _ line _ start at the most front end; judging whether the left indents of the two lines are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left<gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
step 35: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than the number of words in the previous line, i.e. if curr _ line. Else curr line merge score value is unchanged;
wherein, gap _ top _ avg represents the average value of the line spacing between adjacent text lines; gap _ top _ std represents the standard deviation of the line spacing between adjacent text lines; merge score refers to the segment merge score of the current text line unit; gap _ top refers to the line spacing between a previous text line unit and an adjacent text line unit; merge _ buffer refers to segment merge cache; gap left refers to the indentation difference between the previous text line unit and the adjacent text line; gap _ left refers to the indentation difference between the current text line unit and the adjacent text line; gap _ left _ avg refers to the average of the indentation differences between adjacent lines of text; gap _ left _ std refers to the standard deviation of the indent difference between adjacent text lines; line len refers to the current line unit word count; len std refers to the standard deviation of the number of words of a text line; line len refers to the previous text line unit word count.
2. A document segmentation apparatus based on text lines, comprising:
the text line information acquisition module: extracting a page and a document data structure, and extracting the text line information from the document data structure corresponding to each text line; traversing each document data structure containing text lines of the full text, and respectively calculating full text context and page context information according to a text line information list formed by the text line information of the document data structures;
a segmentation and combination module: segmenting text line units in each page according to a segmentation algorithm by combining context information according to n text line information lists in each page acquired by a text line information acquisition module;
the segmentation and combination module specifically comprises:
a no content unit processing module: skipping over text line units without content;
text line unit merging module: firstly, setting a segment merging buffer, temporarily storing a text line unit which is determined to be merged in the buffer before the completion of the merging of the whole segment is determined, and storing the font format of the text line in a set; then setting a segment merging score with an initial value of 0 for each text line unit, merging the text line unit with the previous text line unit if the segment merging score of the new text line is greater than 0 when the process is ended, and putting the text line unit into a segment merging cache; if the segment merging score of the new text line is detected to be less than or equal to 0, ending the current segment merging, clearing the segment merging cache, and starting the new segment segmentation;
the segment merging score calculation process is as follows:
the font format judging module: line font exists in the set font set, then current line merge score + is 5;
the same-line text line unit judgment module: judging whether two adjacent text units are in the same line or not through curr _ line, gap _ top < min if-1 is not more than curr _ line, gap _ top (8),
then two adjacent text units are in the same line, curr _ line, merge _ score + (10), executing the same line text line unit judgment module; otherwise, the two adjacent text units are not in the same line, and the text line unit judgment module of the same paragraph is executed; gap _ top refers to the line spacing between the current text line unit and the adjacent text line unit;
the same paragraph text line unit judgment module: detecting line spacing between texts, if | curr _ line.gap _ top-prev _ line.gap _ top | > gap _ top _ std, indicating that two lines are probably not in the same paragraph if the line spacing of a text line unit is too large, and executing a text line unit indentation judgment module and a font checking module, wherein the unit line spacing of the text line is 10; otherwise, the curr _ line, merge _ score value is unchanged, and the text line unit indentation judgment module is executed;
text line unit indentation judging module: traversing the previous line from merge _ buffer to find the text block prev _ line _ start at the most front end; judging whether the left indents of the two text units are consistent, if so:
-5≤prev_line_start.gap_left-curr_line.gap_left<gap_left_avg-gap_left_std
then the indentation is consistent, curr _ line.merge _ score + ═ 1, otherwise curr _ line.merge _ score- ═ 1;
the font checking module: a word count check, based on the fact that the number of words in the next line in the text of the same paragraph is not significantly more than the number of words in the previous line, i.e. if curr _ line. Else curr line merge score value is unchanged; line len refers to the previous text line unit word count;
wherein, gap _ top _ avg represents the average value of the line spacing between adjacent text lines; gap _ top _ std represents the standard deviation of the line spacing between adjacent text lines; merge score refers to the segment merge score of the current text line unit; gap _ top refers to the line spacing between a previous text line unit and an adjacent text line unit; merge _ buffer refers to segment merge cache; gap left refers to the indentation difference between the previous text line unit and the adjacent text line; gap _ left refers to the indentation difference between the current text line unit and the adjacent text line; gap _ left _ avg refers to the average of the indentation differences between adjacent lines of text; gap _ left _ std refers to the standard deviation of the indent difference between adjacent text lines; line len refers to the current line unit word count; len std refers to the standard deviation of the number of words of a text line; line font refers to the font format of the current line of text unit; font _ set represents a set.
CN201710616443.7A 2017-07-26 2017-07-26 Document segmentation method and device based on text line Active CN107391457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710616443.7A CN107391457B (en) 2017-07-26 2017-07-26 Document segmentation method and device based on text line

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710616443.7A CN107391457B (en) 2017-07-26 2017-07-26 Document segmentation method and device based on text line

Publications (2)

Publication Number Publication Date
CN107391457A CN107391457A (en) 2017-11-24
CN107391457B true CN107391457B (en) 2020-10-27

Family

ID=60341043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710616443.7A Active CN107391457B (en) 2017-07-26 2017-07-26 Document segmentation method and device based on text line

Country Status (1)

Country Link
CN (1) CN107391457B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009151B (en) * 2017-11-29 2021-04-16 深圳中泓在线股份有限公司 News text automatic segmentation method and device, server and readable storage medium
CN109948518B (en) * 2019-03-18 2023-06-09 武汉汉王大数据技术有限公司 Neural network-based PDF document content text paragraph aggregation method
CN110956019B (en) * 2019-11-27 2021-10-26 北大方正集团有限公司 List processing system, method, device and computer readable storage medium
CN113392653A (en) * 2020-03-13 2021-09-14 华为技术有限公司 Translation method, related device, equipment and computer readable storage medium
US11347928B2 (en) 2020-07-27 2022-05-31 International Business Machines Corporation Detecting and processing sections spanning processed document partitions
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876967A (en) * 2010-03-25 2010-11-03 深圳市万兴软件有限公司 Method for generating PDF text paragraphs
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN104317786A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for segmenting text paragraphs
CN106326854A (en) * 2016-08-19 2017-01-11 掌阅科技股份有限公司 Open fixed-layout document paragraph identification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876967A (en) * 2010-03-25 2010-11-03 深圳市万兴软件有限公司 Method for generating PDF text paragraphs
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN104317786A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for segmenting text paragraphs
CN106326854A (en) * 2016-08-19 2017-01-11 掌阅科技股份有限公司 Open fixed-layout document paragraph identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Layout Analysis of Book Pages;Richard Green et al.;《2013 28th International Conference on Image and Vision Computing New Zealand》;20131127;第118-123页 *
基于高阶相关聚类的脱机手写文本行分割;殷亚林 等;《华中师范大学学报(自然科学版)》;20170228;第51卷(第1期);第18-22页,第34页 *

Also Published As

Publication number Publication date
CN107391457A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107391457B (en) Document segmentation method and device based on text line
CN104298982B (en) A kind of character recognition method and device
CN106250830B (en) Digital book structured analysis processing method
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN101719142B (en) Method for detecting picture characters by sparse representation based on classifying dictionary
CN104881458B (en) A kind of mask method and device of Web page subject
US20150095769A1 (en) Layout Analysis Method And System
US9183636B1 (en) Line segmentation method
JP5664174B2 (en) Apparatus and method for extracting circumscribed rectangle of character from portable electronic file
CN112183511A (en) Method, system, storage medium and equipment for deriving table from image
CN111177445A (en) Standard primitive determining method, primitive identifying method and device and electronic equipment
US8924846B2 (en) Apparatus and method for text extraction
CN111368695A (en) Table structure extraction method
CN110765739A (en) Method for extracting table data and chapter structure from PDF document
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
US8768061B2 (en) Post optical character recognition determination of font size
CN115953797A (en) Form recognition method, document acquisition method, and storage medium
CN110795933B (en) Webpage text recognition processing method and device
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN107203509B (en) Title generation method and device
CN101615255A (en) The method that a kind of video text multiframe merges
CN113127595B (en) Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
KR20140031269A (en) Method and device for determining font
CN115983198A (en) Method, device and storage medium for extracting header or footer from PDF document
JPH06203020A (en) Method an device for recognizing and generating text format

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210115

Address after: 41401-41406, 14th floor, unit 1, building 4, No. 966, north section of Tianfu Avenue, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu hi tech Zone, Sichuan 610041

Patentee after: Chengdu Kelai Network Technology Co., Ltd

Address before: 13 / F and 14 / F, unit 1, building 4, No. 966, north section of Tianfu Avenue, high tech Zone, Chengdu, Sichuan 610041

Patentee before: COLASOFT Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 610041 12th, 13th and 14th floors, unit 1, building 4, No. 966, north section of Tianfu Avenue, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan

Patentee after: Kelai Network Technology Co.,Ltd.

Address before: 41401-41406, 14th floor, unit 1, building 4, No. 966, north section of Tianfu Avenue, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu hi tech Zone, Sichuan 610041

Patentee before: Chengdu Kelai Network Technology Co.,Ltd.