CN109284495A

CN109284495A - A kind of pair of text carries out the method and device that table is cut without table line

Info

Publication number: CN109284495A
Application number: CN201811304121.XA
Authority: CN
Inventors: 李鹏辉; 竺晨曦; 邱锡鹏
Original assignee: Shanghai Rhinoceros Technology Co Ltd
Current assignee: Shanghai Rhinoceros Technology Co Ltd
Priority date: 2018-11-03
Filing date: 2018-11-03
Publication date: 2019-01-29
Anticipated expiration: 2038-11-03
Also published as: CN109284495B

Abstract

The present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising: carries out cutting row to text, and obtains the row characteristic information of each line of text and the row contents semantic information of first line of text；Obtain cutting the training data of table model according to row characteristic information and row contents semantic information；The table of no table line will be cut out in the text by cutting table model.Implement the device of the above method, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain module and cut table model.The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and the impact effect due to pattern of no table line table converts, applicability is high, can significant increase the accuracy rate, cost and efficiency of table task are not cut without table line.

Description

A kind of pair of text carries out the method and device that table is cut without table line

Technical field

The present invention relates to a kind of text handling method, especially a kind of pair of texts to carry out method and dress that table is cut without table line It sets.

Background technique

Currently, for there is the table of table line that can easily judge the range of table by the information of wire frame.But For the table of no table line, it is necessary to be according to image (table form) and the common modeling judgement of semantic (content of text) two aspect It is no to belong to table.These judge the mode of table, are difficult complete to write out by a whole set of rule.

Summary of the invention

Aiming at the shortcomings existing in the above problems, the present invention provides one kind and can obtain accurately without table line table A kind of pair of text of range carries out the method and device that table is cut without table line.

To achieve the above object, the present invention provides a kind of pair of text and carries out the method for cutting table without table line, including following step It is rapid:

Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text Contents semantic information；

Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information；

Step 3, the table that no table line will be cut out in the text by cutting table model.

A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 1, including following sub-step:

Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts Current row；

Step 12 parses each line of text, with obtain each line of text row characteristic information and first The row contents semantic information of line of text.

A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein carries out PDF parsing to text, obtains text Text coordinate in this.

A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein row characteristic information includes each line of text The distance between spacing, alignment relation in-between；

Row contents semantic information include the gauge outfit of line of text, subject in terms of semantic text.

Above-mentioned a kind of pair of text carries out the method for cutting table without table line, wherein in step 2, to row characteristic information into Row cleaning, pretreatment, to generate the training data for cutting table model.

A kind of pair of above-mentioned text carries out the method for cutting table without table line, wherein in step 3, including following sub-step:

Step 31 cuts row by the text that text coordinate identifies；

Step 32, the classification that each line of text is determined by cutting table model；

Step 33 is merged each line of text by classification rule, is obtained without table line location；

Step 34, according to range, cut out no table line table.

The device that table is cut without table line, including text coordinate obtaining module, text are carried out the present invention also provides a kind of pair of text Current row cuts row module, line of text parsing module, training data and obtains module and cut table model；

The text coordinate obtaining module, for obtaining the text coordinate in text；

The line of text cuts row module, for carrying out cutting row according to text coordinate pair text, to form multiple line of text；

The line of text parsing module, for being parsed to each line of text, to obtain the row of each line of text The row contents semantic information of characteristic information and first line of text；

The training data obtains module, for according to row characteristic information and row contents semantic information to obtain cutting table model Training data；

It is described to cut table model, for the table of no table line will to be cut out in the text.

Above-mentioned device, wherein row characteristic information includes the distance between each line of text spacing, alignment in-between Relationship；

Above-mentioned device, wherein the training data obtains module and cleaned, pre-processed to row characteristic information, with life At the training data for cutting table model.

Compared with prior art, the invention has the following advantages that

The present invention can replace rule and method, it is more convenient, accurately carry out no table line table cut table task, and not because of nothing The pattern of table line table converts and impact effect, and applicability is high, can significant increase without table line cut table task accuracy rate, at Sheet and efficiency.

Detailed description of the invention

Fig. 1 is the flow chart of method part in the present invention；

Fig. 2 is the structural block diagram of device part in the present invention.

Main appended drawing reference is described as follows:

1- text coordinate obtaining module；2- line of text cuts row module；3- line of text parsing module；4- training data obtains mould Block；5- cuts table model

Specific embodiment

As shown in Figure 1, the present invention provides the method that a kind of pair of text carries out cutting table without table line, comprising the following steps:

Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row of first line of text Contents semantic information.

Including following sub-step wherein, in step 1:

Wherein, PDF parsing is carried out to text, obtains the text coordinate in text.

Wherein, row characteristic information includes the distance between each line of text spacing, alignment relation in-between；

Step 2, the training data for obtaining cutting table model according to row characteristic information and row contents semantic information.

Wherein, row characteristic information cleaned, pre-processed, to generate the training data for cutting table model.

Including following sub-step wherein, in step 3:

Step 31 cuts row by the text that text coordinate identifies；

Step 34, according to range, cut out no table line table.

As shown in Fig. 2, carrying out the device for cutting table without table line the present invention also provides a kind of pair of text, including text coordinate obtains Modulus block 1, line of text cut row module 2, line of text parsing module 3, training data and obtain module 4 and cut table model 5.

Text coordinate obtaining module 1 is used to obtain the text coordinate in text.

Wherein, text coordinate obtaining module is PDF parsing module, for carrying out PDF parsing to text, to obtain in text Text coordinate.

Line of text cuts row module 2 for carrying out cutting row according to text coordinate pair text, to form multiple line of text.

Line of text parsing module 3 is for parsing each line of text, to obtain the row feature of each line of text The row contents semantic information of information and first line of text.

Training data obtains the instruction that module 4 is used to obtain cutting according to row characteristic information and row contents semantic information table model Practice data.

Training data obtains 4 pairs of row characteristic informations of module and is cleaned, pre-processed, to generate the training number for cutting table model According to.

Table model 5 is cut for the table of no table line will to be cut out in the text.

The implementation steps for cutting table model 5 are as follows:

Row is cut by the text that text coordinate identifies；

The classification of each line of text is determined by cutting table model；

Each line of text is merged by classification rule, is obtained without table line location；

According to range, no table line table is cut out.

The foregoing is merely a prefered embodiment of the invention, is merely illustrative and not restrictive for the invention.This is specially Industry technical staff understands, many changes can be carried out to it in the spirit and scope defined by invention claim, modifies, even It is equivalent, but fall in protection scope of the present invention.

Claims

1. a kind of pair of text carries out the method for cutting table without table line, comprising the following steps:

Step 1 carries out text to cut row, and obtains the row characteristic information of each line of text and the row content of first line of text Semantic information；

2. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 1 In, including following sub-step:

Text coordinate in step 11, acquisition text, and carry out cutting row according to text coordinate pair text, to form multiple texts Row；

Step 12 parses each line of text, to obtain the row characteristic information and first text of each line of text Capable row contents semantic information.

3. a kind of pair of text according to claim 2 carries out the method for cutting table without table line, which is characterized in that text into Row PDF parsing, obtains the text coordinate in text.

4. a kind of pair of text according to claim 3 carries out the method for cutting table without table line, which is characterized in that row feature letter Breath includes the distance between each line of text spacing, alignment relation in-between；

5. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 2 In, row characteristic information is cleaned, is pre-processed, to generate the training data for cutting table model.

6. a kind of pair of text according to claim 1 carries out the method for cutting table without table line, which is characterized in that in step 3 In, including following sub-step:

Step 31 cuts row by the text that text coordinate identifies；

Step 34, according to range, cut out no table line table.

7. a kind of device implemented a kind of pair of text described in claim 1 and carry out cutting the method for table without table line, feature exist In, including text coordinate obtaining module, line of text cut row module, line of text parsing module, training data and obtain and module and cut table Model；

The line of text parsing module, for being parsed to each line of text, to obtain the row feature of each line of text The row contents semantic information of information and first line of text；

The training data obtains module, for obtaining cutting the instruction of table model according to row characteristic information and row contents semantic information Practice data；

8. device according to claim 7, which is characterized in that row characteristic information includes between the distance between each line of text Away from, alignment relation in-between；

9. device according to claim 7, which is characterized in that the training data obtains module and carries out to row characteristic information Cleaning, pretreatment, to generate the training data for cutting table model.