CN109284495B - Method and device for performing table-free line table cutting on text - Google Patents
Method and device for performing table-free line table cutting on text Download PDFInfo
- Publication number
- CN109284495B CN109284495B CN201811304121.XA CN201811304121A CN109284495B CN 109284495 B CN109284495 B CN 109284495B CN 201811304121 A CN201811304121 A CN 201811304121A CN 109284495 B CN109284495 B CN 109284495B
- Authority
- CN
- China
- Prior art keywords
- text
- line
- cutting
- training data
- lines
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method for performing table-free line table cutting on a text, which comprises the following steps: cutting lines of the text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information; tables without table lines will be cut out in the text by the cut table model. The device for implementing the method comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model. The invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.
Description
Technical Field
The invention relates to a text processing method, in particular to a method and a device for performing table-free line table cutting on a text.
Background
At present, the range of the table can be easily judged through the information of the wire frame for the table with the table wire. However, for a table without table lines, it is necessary to determine whether the table belongs to the table by modeling both the image (table shape) and the semantic (text content). These ways of determining the table are difficult to write completely through a whole set of rules.
Disclosure of Invention
Aiming at the defects existing in the problems, the invention provides a method and a device for performing table-free line table cutting on a text, which can obtain an accurate table-free line table range.
In order to achieve the above object, the present invention provides a method for performing table-free line table cutting on a text, comprising the following steps:
and 3, cutting out a table without table lines in the text through the table cutting model.
The above method for performing table-free line cutting on a text includes the following sub-steps in step 1:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines;
and step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of the first text line.
The method for performing table-free line table cutting on the text includes performing PDF analysis on the text to obtain the character coordinates in the text.
The above method for performing table-free line table cutting on a text, wherein the line feature information includes a distance interval between each text line and an alignment relationship between the top and the bottom;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subjects.
In the above method for performing table-free line table cutting on a text, in step 2, the line feature information is cleaned and preprocessed to generate training data of a table cutting model.
The above method for performing table-free line cutting on the text, wherein in step 3, the following sub-steps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a cutting table model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
The invention also provides a device for table-free line table cutting of the text, which comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;
the character coordinate acquisition module is used for acquiring character coordinates in the text;
the text line cutting module is used for cutting the text line according to the character coordinates to form a plurality of text lines;
the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line;
the training data acquisition module is used for acquiring training data of the table cutting model according to the line characteristic information and the line content semantic information;
and the table cutting model is used for cutting out a table without table lines in the text.
The above apparatus, wherein the line characteristic information includes a distance between text lines and an alignment relationship between the top and the bottom;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
In the above apparatus, the training data obtaining module performs cleaning and preprocessing on the row characteristic information to generate the training data of the table-cutting model.
Compared with the prior art, the invention has the following advantages:
the invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.
Drawings
FIG. 1 is a flow chart of a portion of the method of the present invention;
fig. 2 is a block diagram showing the structure of the apparatus part of the present invention.
The main reference numerals are explained below:
1-a character coordinate acquisition module; 2-text line cutting module; 3-a text line parsing module; 4-a training data acquisition module; 5-cutting table model
Detailed Description
As shown in fig. 1, the present invention provides a method for performing table-free line table cutting on a text, which comprises the following steps:
Wherein, in step 1, the following substeps are included:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines;
and step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of the first text line.
And performing PDF analysis on the text to obtain character coordinates in the text.
The line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
And 2, obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information.
And cleaning and preprocessing the line characteristic information to generate training data of the table cutting model.
And 3, cutting out a table without table lines in the text through the table cutting model.
Wherein, in step 3, the following substeps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a table cutting model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
As shown in fig. 2, the present invention further provides a device for table-free linear table cutting of a text, which includes a text coordinate obtaining module 1, a text line cutting module 2, a text line analyzing module 3, a training data obtaining module 4, and a table cutting model 5.
The character coordinate acquiring module 1 is used for acquiring character coordinates in a text.
The character coordinate acquisition module is a PDF analysis module and is used for performing PDF analysis on the text to acquire character coordinates in the text.
The text line cutting module 2 is used for cutting the text line according to the character coordinates to form a plurality of text lines.
The text line parsing module 3 is configured to parse each text line to obtain line feature information of each text line and line content semantic information of a first text line.
The line characteristic information comprises distance intervals among all text lines and alignment relation between the upper part and the lower part;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
The training data obtaining module 4 is configured to obtain training data of the table-cutting model according to the line feature information and the line content semantic information.
The training data acquisition module 4 cleans and preprocesses the row feature information to generate training data of the table cutting model.
The cut table model 5 is used to cut a table without table lines in the text.
The table cutting model 5 is implemented as follows:
cutting lines of the text identified by the character coordinates;
judging the category of each text line through a tangent table model;
merging each text row through a classification rule to obtain a range where a table-free line is located;
tables without form lines are cut out according to the ranges.
The foregoing is merely a preferred embodiment of this invention, which is intended to be illustrative, and not limiting. It will be understood by those skilled in the art that various changes, modifications and equivalents may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (4)
1. A method for performing table-free line table cutting on a text comprises the following steps:
step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; the step 1 comprises the following substeps:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines; performing PDF analysis on the text to obtain character coordinates in the text;
step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;
step 2, obtaining training data of a table cutting model according to the line characteristic information and the line content semantic information;
step 3, cutting out a table without table lines from the text through a table cutting model; in the step 3, the following substeps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a table cutting model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
2. The method of claim 1, wherein in step 2, the line feature information is cleaned and preprocessed to generate training data of the table-cutting model.
3. The device for implementing the method for performing the table-free line table cutting on the text according to claim 1 is characterized by comprising a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;
the character coordinate acquisition module is used for acquiring character coordinates in the text;
the text line cutting module is used for cutting the text line according to the character coordinates to form a plurality of text lines;
the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;
the training data acquisition module is used for acquiring training data of the table cutting model according to the line characteristic information and the line content semantic information;
and the table cutting model is used for cutting out a table without table lines in the text.
4. The apparatus of claim 3, wherein the training data acquisition module cleans and preprocesses the row feature information to generate training data for the cut list model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811304121.XA CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811304121.XA CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109284495A CN109284495A (en) | 2019-01-29 |
CN109284495B true CN109284495B (en) | 2023-02-07 |
Family
ID=65175391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811304121.XA Active CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284495B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032718B (en) * | 2019-04-12 | 2023-04-18 | 广州广燃设计有限公司 | Table conversion method, system and storage medium |
CN110210440B (en) * | 2019-06-11 | 2021-04-27 | 中国农业银行股份有限公司 | Table image layout analysis method and system |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08185475A (en) * | 1994-12-28 | 1996-07-16 | Matsushita Electric Ind Co Ltd | Picture recognition device |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN103632388A (en) * | 2013-12-19 | 2014-03-12 | 百度在线网络技术(北京)有限公司 | Semantic annotation method, device and client for image |
CN104094282A (en) * | 2012-01-23 | 2014-10-08 | 微软公司 | Borderless table detection engine |
CN104268545A (en) * | 2014-09-15 | 2015-01-07 | 同方知网(北京)技术有限公司 | Method for table area recognition and content rasterization in electronic document layout files |
CN104517112A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Table recognition method and system |
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
CN105512611A (en) * | 2015-11-25 | 2016-04-20 | 成都数联铭品科技有限公司 | Detection and identification method for form image |
CN105589841A (en) * | 2016-01-15 | 2016-05-18 | 同方知网(北京)技术有限公司 | Portable document format (PDF) document form identification method |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
CN107679024A (en) * | 2017-09-11 | 2018-02-09 | 畅捷通信息技术股份有限公司 | The method of identification form, system, computer equipment, readable storage medium storing program for executing |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN108470021A (en) * | 2018-03-26 | 2018-08-31 | 阿博茨德(北京)科技有限公司 | The localization method and device of table in PDF document |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6757870B1 (en) * | 2000-03-22 | 2004-06-29 | Hewlett-Packard Development Company, L.P. | Automatic table detection method and system |
US10706218B2 (en) * | 2016-05-16 | 2020-07-07 | Linguamatics Ltd. | Extracting information from tables embedded within documents |
-
2018
- 2018-11-03 CN CN201811304121.XA patent/CN109284495B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08185475A (en) * | 1994-12-28 | 1996-07-16 | Matsushita Electric Ind Co Ltd | Picture recognition device |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN104094282A (en) * | 2012-01-23 | 2014-10-08 | 微软公司 | Borderless table detection engine |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN104517112A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Table recognition method and system |
CN103632388A (en) * | 2013-12-19 | 2014-03-12 | 百度在线网络技术(北京)有限公司 | Semantic annotation method, device and client for image |
CN104268545A (en) * | 2014-09-15 | 2015-01-07 | 同方知网(北京)技术有限公司 | Method for table area recognition and content rasterization in electronic document layout files |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
CN105512611A (en) * | 2015-11-25 | 2016-04-20 | 成都数联铭品科技有限公司 | Detection and identification method for form image |
CN105589841A (en) * | 2016-01-15 | 2016-05-18 | 同方知网(北京)技术有限公司 | Portable document format (PDF) document form identification method |
CN107679024A (en) * | 2017-09-11 | 2018-02-09 | 畅捷通信息技术股份有限公司 | The method of identification form, system, computer equipment, readable storage medium storing program for executing |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN108470021A (en) * | 2018-03-26 | 2018-08-31 | 阿博茨德(北京)科技有限公司 | The localization method and device of table in PDF document |
Non-Patent Citations (3)
Title |
---|
Ermelinda Oro ; Massimo Ruffolo.PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents.《2009 10th International Conference on Document Analysis and Recognition》.2009, * |
版式电子文档表格自动检测与性能评估;房婧; 高良才; 仇睿恒; 汤帜;《北京大学学报(自然科学版) 》;20121026;第49卷(第1期);第45-53页 * |
版面分析中表格与图形的鉴别;卜飞宇等;《计算机工程与应用》;20041201(第12期);第86-90页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109284495A (en) | 2019-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Rhythmic brushstrokes distinguish van Gogh from his contemporaries: findings via automated brushstroke extraction | |
US7843450B2 (en) | System and method for filtering point clouds | |
TWI536277B (en) | Form identification method and device | |
KR20190026641A (en) | Method of character recognition of claims document, apparatus, server and storage medium | |
CN110503054B (en) | Text image processing method and device | |
CN109284495B (en) | Method and device for performing table-free line table cutting on text | |
US20130034302A1 (en) | Character recognition apparatus, character recognition method and program | |
CN103646247A (en) | Music score recognition method | |
WO2019041442A1 (en) | Method and system for structural extraction of figure data, electronic device, and computer readable storage medium | |
CN110472054B (en) | Data processing method and device | |
CN114663904A (en) | PDF document layout detection method, device, equipment and medium | |
CN111797772B (en) | Invoice image automatic classification method, system and device | |
CN100492403C (en) | Character image line selecting method and device and character image identifying method and device | |
CN110348714A (en) | Based on code log to the method for the output level evaluation of research staff | |
CN111597805B (en) | Method and device for auditing short message text links based on deep learning | |
CN111241897A (en) | Industrial checklist digitization by inferring visual relationships | |
CN111427996B (en) | Method and device for extracting date and time from man-machine interaction text | |
US20220172458A1 (en) | Individual identification information generation method, individual identification information generation device, and program | |
CN116596921A (en) | Method and system for sorting incinerator slag | |
CN116091481A (en) | Spike counting method, device, equipment and storage medium | |
CN112084103A (en) | Interface test method, device, equipment and medium | |
CN113191351B (en) | Reading identification method and device of digital electric meter and model training method and device | |
CN114490929A (en) | Bidding information acquisition method and device, storage medium and terminal equipment | |
CN110135426B (en) | Sample labeling method and computer storage medium | |
CN113936692A (en) | Big data quality inspection method of customer service voice text based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |