CN109284495B - Method and device for performing table-free line table cutting on text - Google Patents

Method and device for performing table-free line table cutting on text Download PDF

Info

Publication number
CN109284495B
CN109284495B CN201811304121.XA CN201811304121A CN109284495B CN 109284495 B CN109284495 B CN 109284495B CN 201811304121 A CN201811304121 A CN 201811304121A CN 109284495 B CN109284495 B CN 109284495B
Authority
CN
China
Prior art keywords
text
line
cutting
training data
lines
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811304121.XA
Other languages
Chinese (zh)
Other versions
CN109284495A (en
Inventor
李鹏辉
竺晨曦
邱锡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Alphainsight Technology Co ltd
Original Assignee
Shanghai Alphainsight Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Alphainsight Technology Co ltd filed Critical Shanghai Alphainsight Technology Co ltd
Priority to CN201811304121.XA priority Critical patent/CN109284495B/en
Publication of CN109284495A publication Critical patent/CN109284495A/en
Application granted granted Critical
Publication of CN109284495B publication Critical patent/CN109284495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for performing table-free line table cutting on a text, which comprises the following steps: cutting lines of the text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information; tables without table lines will be cut out in the text by the cut table model. The device for implementing the method comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model. The invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.

Description

Method and device for performing table-free line table cutting on text
Technical Field
The invention relates to a text processing method, in particular to a method and a device for performing table-free line table cutting on a text.
Background
At present, the range of the table can be easily judged through the information of the wire frame for the table with the table wire. However, for a table without table lines, it is necessary to determine whether the table belongs to the table by modeling both the image (table shape) and the semantic (text content). These ways of determining the table are difficult to write completely through a whole set of rules.
Disclosure of Invention
Aiming at the defects existing in the problems, the invention provides a method and a device for performing table-free line table cutting on a text, which can obtain an accurate table-free line table range.
In order to achieve the above object, the present invention provides a method for performing table-free line table cutting on a text, comprising the following steps:
step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line;
step 2, obtaining training data of a table cutting model according to the line characteristic information and the line content semantic information;
and 3, cutting out a table without table lines in the text through the table cutting model.
The above method for performing table-free line cutting on a text includes the following sub-steps in step 1:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines;
and step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of the first text line.
The method for performing table-free line table cutting on the text includes performing PDF analysis on the text to obtain the character coordinates in the text.
The above method for performing table-free line table cutting on a text, wherein the line feature information includes a distance interval between each text line and an alignment relationship between the top and the bottom;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subjects.
In the above method for performing table-free line table cutting on a text, in step 2, the line feature information is cleaned and preprocessed to generate training data of a table cutting model.
The above method for performing table-free line cutting on the text, wherein in step 3, the following sub-steps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a cutting table model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
The invention also provides a device for table-free line table cutting of the text, which comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;
the character coordinate acquisition module is used for acquiring character coordinates in the text;
the text line cutting module is used for cutting the text line according to the character coordinates to form a plurality of text lines;
the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line;
the training data acquisition module is used for acquiring training data of the table cutting model according to the line characteristic information and the line content semantic information;
and the table cutting model is used for cutting out a table without table lines in the text.
The above apparatus, wherein the line characteristic information includes a distance between text lines and an alignment relationship between the top and the bottom;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
In the above apparatus, the training data obtaining module performs cleaning and preprocessing on the row characteristic information to generate the training data of the table-cutting model.
Compared with the prior art, the invention has the following advantages:
the invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.
Drawings
FIG. 1 is a flow chart of a portion of the method of the present invention;
fig. 2 is a block diagram showing the structure of the apparatus part of the present invention.
The main reference numerals are explained below:
1-a character coordinate acquisition module; 2-text line cutting module; 3-a text line parsing module; 4-a training data acquisition module; 5-cutting table model
Detailed Description
As shown in fig. 1, the present invention provides a method for performing table-free line table cutting on a text, which comprises the following steps:
step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line.
Wherein, in step 1, the following substeps are included:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines;
and step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of the first text line.
And performing PDF analysis on the text to obtain character coordinates in the text.
The line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
And 2, obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information.
And cleaning and preprocessing the line characteristic information to generate training data of the table cutting model.
And 3, cutting out a table without table lines in the text through the table cutting model.
Wherein, in step 3, the following substeps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a table cutting model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
As shown in fig. 2, the present invention further provides a device for table-free linear table cutting of a text, which includes a text coordinate obtaining module 1, a text line cutting module 2, a text line analyzing module 3, a training data obtaining module 4, and a table cutting model 5.
The character coordinate acquiring module 1 is used for acquiring character coordinates in a text.
The character coordinate acquisition module is a PDF analysis module and is used for performing PDF analysis on the text to acquire character coordinates in the text.
The text line cutting module 2 is used for cutting the text line according to the character coordinates to form a plurality of text lines.
The text line parsing module 3 is configured to parse each text line to obtain line feature information of each text line and line content semantic information of a first text line.
The line characteristic information comprises distance intervals among all text lines and alignment relation between the upper part and the lower part;
the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.
The training data obtaining module 4 is configured to obtain training data of the table-cutting model according to the line feature information and the line content semantic information.
The training data acquisition module 4 cleans and preprocesses the row feature information to generate training data of the table cutting model.
The cut table model 5 is used to cut a table without table lines in the text.
The table cutting model 5 is implemented as follows:
cutting lines of the text identified by the character coordinates;
judging the category of each text line through a tangent table model;
merging each text row through a classification rule to obtain a range where a table-free line is located;
tables without form lines are cut out according to the ranges.
The foregoing is merely a preferred embodiment of this invention, which is intended to be illustrative, and not limiting. It will be understood by those skilled in the art that various changes, modifications and equivalents may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (4)

1. A method for performing table-free line table cutting on a text comprises the following steps:
step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; the step 1 comprises the following substeps:
step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines; performing PDF analysis on the text to obtain character coordinates in the text;
step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;
step 2, obtaining training data of a table cutting model according to the line characteristic information and the line content semantic information;
step 3, cutting out a table without table lines from the text through a table cutting model; in the step 3, the following substeps are included:
step 31, cutting lines of the text identified by the character coordinates;
step 32, judging the category of each text line through a table cutting model;
step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;
and step 34, cutting out the table without the table line according to the range.
2. The method of claim 1, wherein in step 2, the line feature information is cleaned and preprocessed to generate training data of the table-cutting model.
3. The device for implementing the method for performing the table-free line table cutting on the text according to claim 1 is characterized by comprising a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;
the character coordinate acquisition module is used for acquiring character coordinates in the text;
the text line cutting module is used for cutting the text line according to the character coordinates to form a plurality of text lines;
the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;
the training data acquisition module is used for acquiring training data of the table cutting model according to the line characteristic information and the line content semantic information;
and the table cutting model is used for cutting out a table without table lines in the text.
4. The apparatus of claim 3, wherein the training data acquisition module cleans and preprocesses the row feature information to generate training data for the cut list model.
CN201811304121.XA 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text Active CN109284495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811304121.XA CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811304121.XA CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Publications (2)

Publication Number Publication Date
CN109284495A CN109284495A (en) 2019-01-29
CN109284495B true CN109284495B (en) 2023-02-07

Family

ID=65175391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811304121.XA Active CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Country Status (1)

Country Link
CN (1) CN109284495B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032718B (en) * 2019-04-12 2023-04-18 广州广燃设计有限公司 Table conversion method, system and storage medium
CN110210440B (en) * 2019-06-11 2021-04-27 中国农业银行股份有限公司 Table image layout analysis method and system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08185475A (en) * 1994-12-28 1996-07-16 Matsushita Electric Ind Co Ltd Picture recognition device
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN103632388A (en) * 2013-12-19 2014-03-12 百度在线网络技术(北京)有限公司 Semantic annotation method, device and client for image
CN104094282A (en) * 2012-01-23 2014-10-08 微软公司 Borderless table detection engine
CN104268545A (en) * 2014-09-15 2015-01-07 同方知网(北京)技术有限公司 Method for table area recognition and content rasterization in electronic document layout files
CN104517112A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Table recognition method and system
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
CN107679024A (en) * 2017-09-11 2018-02-09 畅捷通信息技术股份有限公司 The method of identification form, system, computer equipment, readable storage medium storing program for executing
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757870B1 (en) * 2000-03-22 2004-06-29 Hewlett-Packard Development Company, L.P. Automatic table detection method and system
US10706218B2 (en) * 2016-05-16 2020-07-07 Linguamatics Ltd. Extracting information from tables embedded within documents

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08185475A (en) * 1994-12-28 1996-07-16 Matsushita Electric Ind Co Ltd Picture recognition device
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN104094282A (en) * 2012-01-23 2014-10-08 微软公司 Borderless table detection engine
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN104517112A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Table recognition method and system
CN103632388A (en) * 2013-12-19 2014-03-12 百度在线网络技术(北京)有限公司 Semantic annotation method, device and client for image
CN104268545A (en) * 2014-09-15 2015-01-07 同方知网(北京)技术有限公司 Method for table area recognition and content rasterization in electronic document layout files
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN107679024A (en) * 2017-09-11 2018-02-09 畅捷通信息技术股份有限公司 The method of identification form, system, computer equipment, readable storage medium storing program for executing
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Ermelinda Oro ; Massimo Ruffolo.PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents.《2009 10th International Conference on Document Analysis and Recognition》.2009, *
版式电子文档表格自动检测与性能评估;房婧; 高良才; 仇睿恒; 汤帜;《北京大学学报(自然科学版) 》;20121026;第49卷(第1期);第45-53页 *
版面分析中表格与图形的鉴别;卜飞宇等;《计算机工程与应用》;20041201(第12期);第86-90页 *

Also Published As

Publication number Publication date
CN109284495A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
Li et al. Rhythmic brushstrokes distinguish van Gogh from his contemporaries: findings via automated brushstroke extraction
US7843450B2 (en) System and method for filtering point clouds
TWI536277B (en) Form identification method and device
KR20190026641A (en) Method of character recognition of claims document, apparatus, server and storage medium
CN110503054B (en) Text image processing method and device
CN109284495B (en) Method and device for performing table-free line table cutting on text
US20130034302A1 (en) Character recognition apparatus, character recognition method and program
CN103646247A (en) Music score recognition method
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN110472054B (en) Data processing method and device
CN114663904A (en) PDF document layout detection method, device, equipment and medium
CN111797772B (en) Invoice image automatic classification method, system and device
CN100492403C (en) Character image line selecting method and device and character image identifying method and device
CN110348714A (en) Based on code log to the method for the output level evaluation of research staff
CN111597805B (en) Method and device for auditing short message text links based on deep learning
CN111241897A (en) Industrial checklist digitization by inferring visual relationships
CN111427996B (en) Method and device for extracting date and time from man-machine interaction text
US20220172458A1 (en) Individual identification information generation method, individual identification information generation device, and program
CN116596921A (en) Method and system for sorting incinerator slag
CN116091481A (en) Spike counting method, device, equipment and storage medium
CN112084103A (en) Interface test method, device, equipment and medium
CN113191351B (en) Reading identification method and device of digital electric meter and model training method and device
CN114490929A (en) Bidding information acquisition method and device, storage medium and terminal equipment
CN110135426B (en) Sample labeling method and computer storage medium
CN113936692A (en) Big data quality inspection method of customer service voice text based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant