CN109284495B

CN109284495B - Method and device for performing table-free line table cutting on text

Info

Publication number: CN109284495B
Application number: CN201811304121.XA
Authority: CN
Inventors: 李鹏辉; 竺晨曦; 邱锡鹏
Original assignee: Shanghai Alphainsight Technology Co ltd
Current assignee: Shanghai Alphainsight Technology Co ltd
Priority date: 2018-11-03
Filing date: 2018-11-03
Publication date: 2023-02-07
Anticipated expiration: 2038-11-03
Also published as: CN109284495A

Abstract

The invention provides a method for performing table-free line table cutting on a text, which comprises the following steps: cutting lines of the text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information; tables without table lines will be cut out in the text by the cut table model. The device for implementing the method comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model. The invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.

Description

Method and device for performing table-free line table cutting on text

Technical Field

The invention relates to a text processing method, in particular to a method and a device for performing table-free line table cutting on a text.

Background

At present, the range of the table can be easily judged through the information of the wire frame for the table with the table wire. However, for a table without table lines, it is necessary to determine whether the table belongs to the table by modeling both the image (table shape) and the semantic (text content). These ways of determining the table are difficult to write completely through a whole set of rules.

Disclosure of Invention

Aiming at the defects existing in the problems, the invention provides a method and a device for performing table-free line table cutting on a text, which can obtain an accurate table-free line table range.

In order to achieve the above object, the present invention provides a method for performing table-free line table cutting on a text, comprising the following steps:

step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line;

step 2, obtaining training data of a table cutting model according to the line characteristic information and the line content semantic information;

and 3, cutting out a table without table lines in the text through the table cutting model.

The above method for performing table-free line cutting on a text includes the following sub-steps in step 1:

step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines;

and step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of the first text line.

The method for performing table-free line table cutting on the text includes performing PDF analysis on the text to obtain the character coordinates in the text.

The above method for performing table-free line table cutting on a text, wherein the line feature information includes a distance interval between each text line and an alignment relationship between the top and the bottom;

the line content semantic information comprises a header of a text line and a semantic text in the aspect of subjects.

In the above method for performing table-free line table cutting on a text, in step 2, the line feature information is cleaned and preprocessed to generate training data of a table cutting model.

The above method for performing table-free line cutting on the text, wherein in step 3, the following sub-steps are included:

step 31, cutting lines of the text identified by the character coordinates;

step 32, judging the category of each text line through a cutting table model;

step 33, merging each text row through a classification rule to obtain a range where the table-free line is located;

and step 34, cutting out the table without the table line according to the range.

The invention also provides a device for table-free line table cutting of the text, which comprises a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;

the character coordinate acquisition module is used for acquiring character coordinates in the text;

the text line cutting module is used for cutting the text line according to the character coordinates to form a plurality of text lines;

the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line;

the training data acquisition module is used for acquiring training data of the table cutting model according to the line characteristic information and the line content semantic information;

and the table cutting model is used for cutting out a table without table lines in the text.

The above apparatus, wherein the line characteristic information includes a distance between text lines and an alignment relationship between the top and the bottom;

the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject.

In the above apparatus, the training data obtaining module performs cleaning and preprocessing on the row characteristic information to generate the training data of the table-cutting model.

Compared with the prior art, the invention has the following advantages:

the invention can replace a rule method, more conveniently and accurately perform the table cutting task of the table-free linear table, does not influence the effect due to the style transformation of the table-free linear table, has high applicability and can greatly improve the accuracy, cost and efficiency of the table-free linear table cutting task.

Drawings

FIG. 1 is a flow chart of a portion of the method of the present invention;

fig. 2 is a block diagram showing the structure of the apparatus part of the present invention.

The main reference numerals are explained below:

1-a character coordinate acquisition module; 2-text line cutting module; 3-a text line parsing module; 4-a training data acquisition module; 5-cutting table model

Detailed Description

As shown in fig. 1, the present invention provides a method for performing table-free line table cutting on a text, which comprises the following steps:

step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line.

Wherein, in step 1, the following substeps are included:

And performing PDF analysis on the text to obtain character coordinates in the text.

The line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part;

And 2, obtaining training data of the table cutting model according to the line characteristic information and the line content semantic information.

And cleaning and preprocessing the line characteristic information to generate training data of the table cutting model.

Wherein, in step 3, the following substeps are included:

step 31, cutting lines of the text identified by the character coordinates;

step 32, judging the category of each text line through a table cutting model;

As shown in fig. 2, the present invention further provides a device for table-free linear table cutting of a text, which includes a text coordinate obtaining module 1, a text line cutting module 2, a text line analyzing module 3, a training data obtaining module 4, and a table cutting model 5.

The character coordinate acquiring module 1 is used for acquiring character coordinates in a text.

The character coordinate acquisition module is a PDF analysis module and is used for performing PDF analysis on the text to acquire character coordinates in the text.

The text line cutting module 2 is used for cutting the text line according to the character coordinates to form a plurality of text lines.

The text line parsing module 3 is configured to parse each text line to obtain line feature information of each text line and line content semantic information of a first text line.

The line characteristic information comprises distance intervals among all text lines and alignment relation between the upper part and the lower part;

The training data obtaining module 4 is configured to obtain training data of the table-cutting model according to the line feature information and the line content semantic information.

The training data acquisition module 4 cleans and preprocesses the row feature information to generate training data of the table cutting model.

The cut table model 5 is used to cut a table without table lines in the text.

The table cutting model 5 is implemented as follows:

cutting lines of the text identified by the character coordinates;

judging the category of each text line through a tangent table model;

merging each text row through a classification rule to obtain a range where a table-free line is located;

tables without form lines are cut out according to the ranges.

The foregoing is merely a preferred embodiment of this invention, which is intended to be illustrative, and not limiting. It will be understood by those skilled in the art that various changes, modifications and equivalents may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for performing table-free line table cutting on a text comprises the following steps:

step 1, cutting lines of a text, and obtaining line characteristic information of each text line and line content semantic information of a first text line; the step 1 comprises the following substeps:

step 11, obtaining character coordinates in the text, and cutting lines of the text according to the character coordinates to form a plurality of text lines; performing PDF analysis on the text to obtain character coordinates in the text;

step 12, analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;

step 3, cutting out a table without table lines from the text through a table cutting model; in the step 3, the following substeps are included:

step 31, cutting lines of the text identified by the character coordinates;

step 32, judging the category of each text line through a table cutting model;

2. The method of claim 1, wherein in step 2, the line feature information is cleaned and preprocessed to generate training data of the table-cutting model.

3. The device for implementing the method for performing the table-free line table cutting on the text according to claim 1 is characterized by comprising a character coordinate acquisition module, a text line cutting module, a text line analysis module, a training data acquisition module and a table cutting model;

the text line analysis module is used for analyzing each text line to obtain line characteristic information of each text line and line content semantic information of a first text line; the line characteristic information comprises the distance interval between each text line and the alignment relation between the upper part and the lower part; the line content semantic information comprises a header of a text line and a semantic text in the aspect of subject;

4. The apparatus of claim 3, wherein the training data acquisition module cleans and preprocesses the row feature information to generate training data for the cut list model.