CN109284495A - A kind of pair of text carries out the method and device that table is cut without table line - Google Patents
A kind of pair of text carries out the method and device that table is cut without table line Download PDFInfo
- Publication number
- CN109284495A CN109284495A CN201811304121.XA CN201811304121A CN109284495A CN 109284495 A CN109284495 A CN 109284495A CN 201811304121 A CN201811304121 A CN 201811304121A CN 109284495 A CN109284495 A CN 109284495A
- Authority
- CN
- China
- Prior art keywords
- text
- line
- row
- cutting
- cut
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising: carries out cutting row to text, and obtains the row characteristic information of each line of text and the row contents semantic information of first line of text;Obtain cutting the training data of table model according to row characteristic information and row contents semantic information;The table of no table line will be cut out in the text by cutting table model.Implement the device of the above method, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain module and cut table model.The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and the impact effect due to pattern of no table line table converts, applicability is high, can significant increase the accuracy rate, cost and efficiency of table task are not cut without table line.
Description
Technical field
The present invention relates to a kind of text handling method, especially a kind of pair of texts to carry out method and dress that table is cut without table line
It sets.
Background technique
Currently, for there is the table of table line that can easily judge the range of table by the information of wire frame.But
For the table of no table line, it is necessary to be according to image (table form) and the common modeling judgement of semantic (content of text) two aspect
It is no to belong to table.These judge the mode of table, are difficult complete to write out by a whole set of rule.
Summary of the invention
Aiming at the shortcomings existing in the above problems, the present invention provides one kind and can obtain accurately without table line table
A kind of pair of text of range carries out the method and device that table is cut without table line.
To achieve the above object, the present invention provides a kind of pair of text and carries out the method for cutting table without table line, including following step
It is rapid:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text
Contents semantic information;
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information;
Step 3, the table that no table line will be cut out in the text by cutting table model.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 1, including following sub-step:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts
Current row;
Step 12 parses each line of text, with obtain each line of text row characteristic information and first
The row contents semantic information of line of text.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein carries out PDF parsing to text, obtains text
Text coordinate in this.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein row characteristic information includes each line of text
The distance between spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Above-mentioned a kind of pair of text carries out the method for cutting table without table line, wherein in step 2, to row characteristic information into
Row cleaning, pretreatment, to generate the training data for cutting table model.
A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 3, including following sub-step:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
The device that table is cut without table line, including text coordinate obtaining module, text are carried out the present invention also provides a kind of pair of text
Current row cuts row module, line of text parsing module, training data and obtains module and cut table model;
The text coordinate obtaining module, for obtaining the text coordinate in text;
The line of text cuts row module, for carrying out cutting row according to text coordinate pair text, to form multiple line of text;
The line of text parsing module, for being parsed to each line of text, to obtain the row of each line of text
The row contents semantic information of characteristic information and first line of text;
The training data obtains module, for according to row characteristic information and row contents semantic information to obtain cutting table model
Training data;
It is described to cut table model, for the table of no table line will to be cut out in the text.
Above-mentioned device, wherein row characteristic information includes the distance between each line of text spacing, alignment in-between
Relationship;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Above-mentioned device, wherein the training data obtains module and cleaned, pre-processed to row characteristic information, with life
At the training data for cutting table model.
Compared with prior art, the invention has the following advantages that
The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and not because of nothing
The pattern of table line table converts and impact effect, and applicability is high, can significant increase without table line cut table task accuracy rate, at
Sheet and efficiency.
Detailed description of the invention
Fig. 1 is the flow chart of method part in the present invention;
Fig. 2 is the structural block diagram of device part in the present invention.
Main appended drawing reference is described as follows:
1- text coordinate obtaining module;2- line of text cuts row module;3- line of text parsing module;4- training data obtains mould
Block;5- cuts table model
Specific embodiment
As shown in Figure 1, the present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising the following steps:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text
Contents semantic information.
Including following sub-step wherein, in step 1:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts
Current row;
Step 12 parses each line of text, with obtain each line of text row characteristic information and first
The row contents semantic information of line of text.
Wherein, PDF parsing is carried out to text, obtains the text coordinate in text.
Wherein, row characteristic information includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information.
Wherein, row characteristic information cleaned, pre-processed, to generate the training data for cutting table model.
Step 3, the table that no table line will be cut out in the text by cutting table model.
Including following sub-step wherein, in step 3:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
As shown in Fig. 2, carrying out the device for cutting table without table line the present invention also provides a kind of pair of text, including text coordinate obtains
Modulus block 1, line of text cut row module 2, line of text parsing module 3, training data and obtain module 4 and cut table model 5.
Text coordinate obtaining module 1 is used to obtain the text coordinate in text.
Wherein, text coordinate obtaining module is PDF parsing module, for carrying out PDF parsing to text, to obtain in text
Text coordinate.
Line of text cuts row module 2 for carrying out cutting row according to text coordinate pair text, to form multiple line of text.
Line of text parsing module 3 is for parsing each line of text, to obtain the row feature of each line of text
The row contents semantic information of information and first line of text.
Wherein, row characteristic information includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
Training data obtains the instruction that module 4 is used to obtain cutting according to row characteristic information and row contents semantic information table model
Practice data.
Training data obtains 4 pairs of row characteristic informations of module and is cleaned, pre-processed, to generate the training number for cutting table model
According to.
Table model 5 is cut for the table of no table line will to be cut out in the text.
The implementation steps for cutting table model 5 are as follows:
Row is cut by the text that text coordinate identifies;
The classification of each line of text is determined by cutting table model;
Each line of text is merged by classification rule, is obtained without table line location;
According to range, no table line table is cut out.
The foregoing is merely a prefered embodiment of the invention, is merely illustrative and not restrictive for the invention.This is specially
Industry technical staff understands, many changes can be carried out to it in the spirit and scope defined by invention claim, modifies, even
It is equivalent, but fall in protection scope of the present invention.
Claims (9)
1. a kind of pair of text carries out the method for cutting table without table line, comprising the following steps:
Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row content of first line of text
Semantic information;
Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information;
Step 3, the table that no table line will be cut out in the text by cutting table model.
2. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 1
In, including following sub-step:
Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts
Row;
Step 12 parses each line of text, to obtain the row characteristic information and first text of each line of text
Capable row contents semantic information.
3. a kind of pair of text according to claim 2 carries out the method for cutting table without table line, which is characterized in that text into
Row PDF parsing, obtains the text coordinate in text.
4. a kind of pair of text according to claim 3 carries out the method for cutting table without table line, which is characterized in that row feature letter
Breath includes the distance between each line of text spacing, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
5. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 2
In, row characteristic information is cleaned, is pre-processed, to generate the training data for cutting table model.
6. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 3
In, including following sub-step:
Step 31 cuts row by the text that text coordinate identifies;
Step 32, the classification that each line of text is determined by cutting table model;
Step 33 is merged each line of text by classification rule, is obtained without table line location;
Step 34, according to range, cut out no table line table.
7. a kind of device implemented a kind of pair of text described in claim 1 and carry out cutting the method for table without table line, feature exist
In, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain and module and cut table
Model;
The text coordinate obtaining module, for obtaining the text coordinate in text;
The line of text cuts row module, for carrying out cutting row according to text coordinate pair text, to form multiple line of text;
The line of text parsing module, for being parsed to each line of text, to obtain the row feature of each line of text
The row contents semantic information of information and first line of text;
The training data obtains module, for obtaining cutting the instruction of table model according to row characteristic information and row contents semantic information
Practice data;
It is described to cut table model, for the table of no table line will to be cut out in the text.
8. device according to claim 7, which is characterized in that row characteristic information includes between the distance between each line of text
Away from, alignment relation in-between;
Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.
9. device according to claim 7, which is characterized in that the training data obtains module and carries out to row characteristic information
Cleaning, pretreatment, to generate the training data for cutting table model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811304121.XA CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811304121.XA CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109284495A true CN109284495A (en) | 2019-01-29 |
CN109284495B CN109284495B (en) | 2023-02-07 |
Family
ID=65175391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811304121.XA Active CN109284495B (en) | 2018-11-03 | 2018-11-03 | Method and device for performing table-free line table cutting on text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284495B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032718A (en) * | 2019-04-12 | 2019-07-19 | 广州广燃设计有限公司 | A kind of table conversion method, system and storage medium |
CN110210440A (en) * | 2019-06-11 | 2019-09-06 | 中国农业银行股份有限公司 | A kind of form image printed page analysis method and system |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08185475A (en) * | 1994-12-28 | 1996-07-16 | Matsushita Electric Ind Co Ltd | Picture recognition device |
US20040093355A1 (en) * | 2000-03-22 | 2004-05-13 | Stinger James R. | Automatic table detection method and system |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN103632388A (en) * | 2013-12-19 | 2014-03-12 | 百度在线网络技术(北京)有限公司 | Semantic annotation method, device and client for image |
CN104094282A (en) * | 2012-01-23 | 2014-10-08 | 微软公司 | Borderless table detection engine |
CN104268545A (en) * | 2014-09-15 | 2015-01-07 | 同方知网(北京)技术有限公司 | Method for table area recognition and content rasterization in electronic document layout files |
CN104517112A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Table recognition method and system |
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
CN105512611A (en) * | 2015-11-25 | 2016-04-20 | 成都数联铭品科技有限公司 | Detection and identification method for form image |
CN105589841A (en) * | 2016-01-15 | 2016-05-18 | 同方知网(北京)技术有限公司 | Portable document format (PDF) document form identification method |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
US20170329749A1 (en) * | 2016-05-16 | 2017-11-16 | Linguamatics Ltd. | Extracting information from tables embedded within documents |
CN107679024A (en) * | 2017-09-11 | 2018-02-09 | 畅捷通信息技术股份有限公司 | The method of identification form, system, computer equipment, readable storage medium storing program for executing |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN108470021A (en) * | 2018-03-26 | 2018-08-31 | 阿博茨德(北京)科技有限公司 | The localization method and device of table in PDF document |
-
2018
- 2018-11-03 CN CN201811304121.XA patent/CN109284495B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08185475A (en) * | 1994-12-28 | 1996-07-16 | Matsushita Electric Ind Co Ltd | Picture recognition device |
US20040093355A1 (en) * | 2000-03-22 | 2004-05-13 | Stinger James R. | Automatic table detection method and system |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN104094282A (en) * | 2012-01-23 | 2014-10-08 | 微软公司 | Borderless table detection engine |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN104517112A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Table recognition method and system |
CN103632388A (en) * | 2013-12-19 | 2014-03-12 | 百度在线网络技术(北京)有限公司 | Semantic annotation method, device and client for image |
CN104268545A (en) * | 2014-09-15 | 2015-01-07 | 同方知网(北京)技术有限公司 | Method for table area recognition and content rasterization in electronic document layout files |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106156239A (en) * | 2015-04-27 | 2016-11-23 | 中国移动通信集团公司 | A kind of form abstracting method and device |
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN105426834A (en) * | 2015-11-17 | 2016-03-23 | 中国传媒大学 | Projection feature and structure feature based form image detection method |
CN105512611A (en) * | 2015-11-25 | 2016-04-20 | 成都数联铭品科技有限公司 | Detection and identification method for form image |
CN105589841A (en) * | 2016-01-15 | 2016-05-18 | 同方知网(北京)技术有限公司 | Portable document format (PDF) document form identification method |
US20170329749A1 (en) * | 2016-05-16 | 2017-11-16 | Linguamatics Ltd. | Extracting information from tables embedded within documents |
CN107679024A (en) * | 2017-09-11 | 2018-02-09 | 畅捷通信息技术股份有限公司 | The method of identification form, system, computer equipment, readable storage medium storing program for executing |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN108470021A (en) * | 2018-03-26 | 2018-08-31 | 阿博茨德(北京)科技有限公司 | The localization method and device of table in PDF document |
Non-Patent Citations (3)
Title |
---|
ERMELINDA ORO;MASSIMO RUFFOLO: "PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents", 《2009 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION》 * |
卜飞宇等: "版面分析中表格与图形的鉴别", 《计算机工程与应用》 * |
房婧; 高良才; 仇睿恒; 汤帜: "版式电子文档表格自动检测与性能评估", 《北京大学学报(自然科学版) 》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032718A (en) * | 2019-04-12 | 2019-07-19 | 广州广燃设计有限公司 | A kind of table conversion method, system and storage medium |
CN110032718B (en) * | 2019-04-12 | 2023-04-18 | 广州广燃设计有限公司 | Table conversion method, system and storage medium |
CN110210440A (en) * | 2019-06-11 | 2019-09-06 | 中国农业银行股份有限公司 | A kind of form image printed page analysis method and system |
CN110210440B (en) * | 2019-06-11 | 2021-04-27 | 中国农业银行股份有限公司 | Table image layout analysis method and system |
Also Published As
Publication number | Publication date |
---|---|
CN109284495B (en) | 2023-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6710483B2 (en) | Character recognition method for damages claim document, device, server and storage medium | |
CN107204184B (en) | Audio recognition method and system | |
CN107392143A (en) | A kind of resume accurate Analysis method based on SVM text classifications | |
CN105868171B (en) | A kind of method of calibration and device of Excel file | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN106095834A (en) | Intelligent dialogue method and system based on topic | |
WO2017051425A8 (en) | A computer-implemented method and system for analyzing and evaluating user reviews | |
CN104636742B (en) | A kind of method by imaging automatic lock onto target topic and transmitting | |
CN106649239A (en) | Method and device for generating report in cloud monitoring system based on visualization | |
CN103377239A (en) | Method and device for calculating inter-textual similarity | |
CN109284495A (en) | A kind of pair of text carries out the method and device that table is cut without table line | |
CN105096016B (en) | Print order automatic generation method and device | |
CN107193948A (en) | Human-computer dialogue data analysing method and device | |
CN104915420B (en) | Knowledge base data processing method and system | |
CN104142912A (en) | Accurate corpus category marking method and device | |
CN109508373A (en) | Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index | |
CN110442853A (en) | Text positioning method, device, terminal and storage medium | |
CN103488627B8 (en) | Full piece patent document interpretation method and translation system | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN108009297A (en) | Text emotion analysis method and system based on natural language processing | |
CN106055633A (en) | Chinese microblog subjective and objective sentence classification method | |
CN109214009A (en) | A kind of service dispatch repeats the work order text semantic method of vector analysis of incoming call | |
CN105719261A (en) | Point cloud data combination system and method | |
CN103440197B (en) | A kind of method automatically generating difference test report based on contrast test | |
CN111427996B (en) | Method and device for extracting date and time from man-machine interaction text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |