CN109284495A - A kind of pair of text carries out the method and device that table is cut without table line - Google Patents

A kind of pair of text carries out the method and device that table is cut without table line Download PDF

Info

Publication number
CN109284495A
CN109284495A CN201811304121.XA CN201811304121A CN109284495A CN 109284495 A CN109284495 A CN 109284495A CN 201811304121 A CN201811304121 A CN 201811304121A CN 109284495 A CN109284495 A CN 109284495A
Authority
CN
China
Prior art keywords
text
line
row
cutting
cut
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811304121.XA
Other languages
Chinese (zh)
Other versions
CN109284495B (en
Inventor
李鹏辉
竺晨曦
邱锡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Rhinoceros Technology Co Ltd
Original Assignee
Shanghai Rhinoceros Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Rhinoceros Technology Co Ltd filed Critical Shanghai Rhinoceros Technology Co Ltd
Priority to CN201811304121.XA priority Critical patent/CN109284495B/en
Publication of CN109284495A publication Critical patent/CN109284495A/en
Application granted granted Critical
Publication of CN109284495B publication Critical patent/CN109284495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines

Abstract

The present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising: carries out cutting row to text, and obtains the row characteristic information of each line of text and the row contents semantic information of first line of text;Obtain cutting the training data of table model according to row characteristic information and row contents semantic information;The table of no table line will be cut out in the text by cutting table model.Implement the device of the above method, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain module and cut table model.The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and the impact effect due to pattern of no table line table converts, applicability is high, can significant increase the accuracy rate, cost and efficiency of table task are not cut without table line.

Description

A kind of pair of text carries out the method and device that table is cut without table line
Technical field
The present invention relates to a kind of text handling method, especially a kind of pair of texts to carry out method and dress that table is cut without table line It sets.
Background technique
Currently, for there is the table of table line that can easily judge the range of table by the information of wire frame.But For the table of no table line, it is necessary to be according to image (table form) and the common modeling judgement of semantic (content of text) two aspect It is no to belong to table.These judge the mode of table, are difficult complete to write out by a whole set of rule.
Summary of the invention
Aiming at the shortcomings existing in the above problems, the present invention provides one kind and can obtain accurately without table line table A kind of pair of text of range carries out the method and device that table is cut without table line.
To achieve the above object, the present invention provides a kind of pair of text and carries out the method for cutting table without table line, including following step It is rapid:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text Contents semantic information;
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information;
Step 3, the table that no table line will be cut out in the text by cutting table model.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 1, including following sub-step:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts Current row;
Step 12 parses each line of text, with obtain each line of text row characteristic information and first The row contents semantic information of line of text.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein carries out PDF parsing to text, obtains text Text coordinate in this.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein row characteristic information includes each line of text The distance between spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Above-mentioned a kind of pair of text carries out the method for cutting table without table line, wherein in step 2, to row characteristic information into Row cleaning, pretreatment, to generate the training data for cutting table model.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 3, including following sub-step:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
The device that table is cut without table line, including text coordinate obtaining module, text are carried out the present invention also provides a kind of pair of text Current row cuts row module, line of text parsing module, training data and obtains module and cut table model;
The text coordinate obtaining module, for obtaining the text coordinate in text;
The line of text cuts row module, for carrying out cutting row according to text coordinate pair text, to form multiple line of text;
The line of text parsing module, for being parsed to each line of text, to obtain the row of each line of text The row contents semantic information of characteristic information and first line of text;
The training data obtains module, for according to row characteristic information and row contents semantic information to obtain cutting table model Training data;
It is described to cut table model, for the table of no table line will to be cut out in the text.
Above-mentioned device, wherein row characteristic information includes the distance between each line of text spacing, alignment in-between Relationship;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Above-mentioned device, wherein the training data obtains module and cleaned, pre-processed to row characteristic information, with life At the training data for cutting table model.
Compared with prior art, the invention has the following advantages that
The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and not because of nothing The pattern of table line table converts and impact effect, and applicability is high, can significant increase without table line cut table task accuracy rate, at Sheet and efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of method part in the present invention;
Fig. 2 is the structural block diagram of device part in the present invention.
Main appended drawing reference is described as follows:
1- text coordinate obtaining module;2- line of text cuts row module;3- line of text parsing module;4- training data obtains mould Block;5- cuts table model
Specific embodiment
As shown in Figure 1, the present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising the following steps:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text Contents semantic information.
Including following sub-step wherein, in step 1:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts Current row;
Step 12 parses each line of text, with obtain each line of text row characteristic information and first The row contents semantic information of line of text.
Wherein, PDF parsing is carried out to text, obtains the text coordinate in text.
Wherein, row characteristic information includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information.
Wherein, row characteristic information cleaned, pre-processed, to generate the training data for cutting table model.
Step 3, the table that no table line will be cut out in the text by cutting table model.
Including following sub-step wherein, in step 3:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
As shown in Fig. 2, carrying out the device for cutting table without table line the present invention also provides a kind of pair of text, including text coordinate obtains Modulus block 1, line of text cut row module 2, line of text parsing module 3, training data and obtain module 4 and cut table model 5.
Text coordinate obtaining module 1 is used to obtain the text coordinate in text.
Wherein, text coordinate obtaining module is PDF parsing module, for carrying out PDF parsing to text, to obtain in text Text coordinate.
Line of text cuts row module 2 for carrying out cutting row according to text coordinate pair text, to form multiple line of text.
Line of text parsing module 3 is for parsing each line of text, to obtain the row feature of each line of text The row contents semantic information of information and first line of text.
Wherein, row characteristic information includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Training data obtains the instruction that module 4 is used to obtain cutting according to row characteristic information and row contents semantic information table model Practice data.
Training data obtains 4 pairs of row characteristic informations of module and is cleaned, pre-processed, to generate the training number for cutting table model According to.
Table model 5 is cut for the table of no table line will to be cut out in the text.
The implementation steps for cutting table model 5 are as follows:
Row is cut by the text that text coordinate identifies;
The classification of each line of text is determined by cutting table model;
Each line of text is merged by classification rule, is obtained without table line location;
According to range, no table line table is cut out.
The foregoing is merely a prefered embodiment of the invention, is merely illustrative and not restrictive for the invention.This is specially Industry technical staff understands, many changes can be carried out to it in the spirit and scope defined by invention claim, modifies, even It is equivalent, but fall in protection scope of the present invention.

Claims (9)

1. a kind of pair of text carries out the method for cutting table without table line, comprising the following steps:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row content of first line of text Semantic information;
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information;
Step 3, the table that no table line will be cut out in the text by cutting table model.
2. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 1 In, including following sub-step:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts Row;
Step 12 parses each line of text, to obtain the row characteristic information and first text of each line of text Capable row contents semantic information.
3. a kind of pair of text according to claim 2 carries out the method for cutting table without table line, which is characterized in that text into Row PDF parsing, obtains the text coordinate in text.
4. a kind of pair of text according to claim 3 carries out the method for cutting table without table line, which is characterized in that row feature letter Breath includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
5. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 2 In, row characteristic information is cleaned, is pre-processed, to generate the training data for cutting table model.
6. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 3 In, including following sub-step:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
7. a kind of device implemented a kind of pair of text described in claim 1 and carry out cutting the method for table without table line, feature exist In, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain and module and cut table Model;
The text coordinate obtaining module, for obtaining the text coordinate in text;
The line of text cuts row module, for carrying out cutting row according to text coordinate pair text, to form multiple line of text;
The line of text parsing module, for being parsed to each line of text, to obtain the row feature of each line of text The row contents semantic information of information and first line of text;
The training data obtains module, for obtaining cutting the instruction of table model according to row characteristic information and row contents semantic information Practice data;
It is described to cut table model, for the table of no table line will to be cut out in the text.
8. device according to claim 7, which is characterized in that row characteristic information includes between the distance between each line of text Away from, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
9. device according to claim 7, which is characterized in that the training data obtains module and carries out to row characteristic information Cleaning, pretreatment, to generate the training data for cutting table model.
CN201811304121.XA 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text Active CN109284495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811304121.XA CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811304121.XA CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Publications (2)

Publication Number Publication Date
CN109284495A true CN109284495A (en) 2019-01-29
CN109284495B CN109284495B (en) 2023-02-07

Family

ID=65175391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811304121.XA Active CN109284495B (en) 2018-11-03 2018-11-03 Method and device for performing table-free line table cutting on text

Country Status (1)

Country Link
CN (1) CN109284495B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032718A (en) * 2019-04-12 2019-07-19 广州广燃设计有限公司 A kind of table conversion method, system and storage medium
CN110210440A (en) * 2019-06-11 2019-09-06 中国农业银行股份有限公司 A kind of form image printed page analysis method and system

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08185475A (en) * 1994-12-28 1996-07-16 Matsushita Electric Ind Co Ltd Picture recognition device
US20040093355A1 (en) * 2000-03-22 2004-05-13 Stinger James R. Automatic table detection method and system
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN103632388A (en) * 2013-12-19 2014-03-12 百度在线网络技术(北京)有限公司 Semantic annotation method, device and client for image
CN104094282A (en) * 2012-01-23 2014-10-08 微软公司 Borderless table detection engine
CN104268545A (en) * 2014-09-15 2015-01-07 同方知网(北京)技术有限公司 Method for table area recognition and content rasterization in electronic document layout files
CN104517112A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Table recognition method and system
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
US20170329749A1 (en) * 2016-05-16 2017-11-16 Linguamatics Ltd. Extracting information from tables embedded within documents
CN107679024A (en) * 2017-09-11 2018-02-09 畅捷通信息技术股份有限公司 The method of identification form, system, computer equipment, readable storage medium storing program for executing
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08185475A (en) * 1994-12-28 1996-07-16 Matsushita Electric Ind Co Ltd Picture recognition device
US20040093355A1 (en) * 2000-03-22 2004-05-13 Stinger James R. Automatic table detection method and system
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN104094282A (en) * 2012-01-23 2014-10-08 微软公司 Borderless table detection engine
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN104517112A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Table recognition method and system
CN103632388A (en) * 2013-12-19 2014-03-12 百度在线网络技术(北京)有限公司 Semantic annotation method, device and client for image
CN104268545A (en) * 2014-09-15 2015-01-07 同方知网(北京)技术有限公司 Method for table area recognition and content rasterization in electronic document layout files
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method
CN105426834A (en) * 2015-11-17 2016-03-23 中国传媒大学 Projection feature and structure feature based form image detection method
CN105512611A (en) * 2015-11-25 2016-04-20 成都数联铭品科技有限公司 Detection and identification method for form image
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
US20170329749A1 (en) * 2016-05-16 2017-11-16 Linguamatics Ltd. Extracting information from tables embedded within documents
CN107679024A (en) * 2017-09-11 2018-02-09 畅捷通信息技术股份有限公司 The method of identification form, system, computer equipment, readable storage medium storing program for executing
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ERMELINDA ORO;MASSIMO RUFFOLO: "PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents", 《2009 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION》 *
卜飞宇等: "版面分析中表格与图形的鉴别", 《计算机工程与应用》 *
房婧; 高良才; 仇睿恒; 汤帜: "版式电子文档表格自动检测与性能评估", 《北京大学学报(自然科学版) 》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032718A (en) * 2019-04-12 2019-07-19 广州广燃设计有限公司 A kind of table conversion method, system and storage medium
CN110032718B (en) * 2019-04-12 2023-04-18 广州广燃设计有限公司 Table conversion method, system and storage medium
CN110210440A (en) * 2019-06-11 2019-09-06 中国农业银行股份有限公司 A kind of form image printed page analysis method and system
CN110210440B (en) * 2019-06-11 2021-04-27 中国农业银行股份有限公司 Table image layout analysis method and system

Also Published As

Publication number Publication date
CN109284495B (en) 2023-02-07

Similar Documents

Publication Publication Date Title
JP6710483B2 (en) Character recognition method for damages claim document, device, server and storage medium
CN108470021A (en) The localization method and device of table in PDF document
CN105468468B (en) Data error-correcting method towards question answering system and device
CN107392143B (en) Resume accurate analysis method based on SVM text classification
CN105868171B (en) A kind of method of calibration and device of Excel file
CN111666761B (en) Fine-grained emotion analysis model training method and device
WO2017051425A8 (en) A computer-implemented method and system for analyzing and evaluating user reviews
EP2849103A3 (en) An arrangement and a method for creating a synthesis from numerical data and textual information
CN109284495A (en) A kind of pair of text carries out the method and device that table is cut without table line
CN105096016B (en) Print order automatic generation method and device
CN107193948A (en) Human-computer dialogue data analysing method and device
CN104915420B (en) Knowledge base data processing method and system
CN104142912A (en) Accurate corpus category marking method and device
CN108363943A (en) Clearance robot based on Weigh sensor technology
CN103488627B8 (en) Full piece patent document interpretation method and translation system
CN108009297A (en) Text emotion analysis method and system based on natural language processing
CN104347071A (en) Method and system for generating oral test reference answer
CN106055633A (en) Chinese microblog subjective and objective sentence classification method
CN111427996B (en) Method and device for extracting date and time from man-machine interaction text
ATE409937T1 (en) METHOD AND APPARATUS FOR SENDING VOICE DATA TO A REMOTE DEVICE IN A DISTRIBUTED VOICE RECOGNITION SYSTEM
CN109214009A (en) A kind of service dispatch repeats the work order text semantic method of vector analysis of incoming call
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN105719261A (en) Point cloud data combination system and method
CN108920955A (en) A kind of webpage back door detection method, device, equipment and storage medium
CN112818693A (en) Automatic extraction method and system for electronic component model words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant