CN115546809A

CN115546809A - Table structure identification method based on cell constraint and application thereof

Info

Publication number: CN115546809A
Application number: CN202211506901.9A
Authority: CN
Inventors: 毛云青; 雷卓; 彭大蒙; 张香伟
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2022-12-30

Abstract

The application provides a table structure identification method based on cell constraint and application thereof, comprising the following steps: inputting different form images into corresponding target detection models according to the form types so as to output position information of corresponding forms; acquiring information of a whole row and a whole column of the table according to the position information of the table, wherein the information comprises coordinates of any two nonadjacent vertexes of each row and each column; identifying all texts and borders thereof in the table; obtaining table structure information of a table through a machine translation method; correcting cells in the table according to the position information of the table, the information of the whole row and the whole column of the table and the table structure information of the table; and clustering the coordinates of the cells in the corrected table to generate a cell structure of the table. The method can be combined with any optical symbol recognition method, and the table structure can be analyzed from the electronic document end to end through a complete table recognition technology.

Description

Table structure identification method based on cell constraint and application thereof

Technical Field

The application relates to the technical field of table identification, in particular to a table structure identification method based on cell constraint and application thereof.

Background

In practical applications, the key information of an electronic document is usually summarized in a table form. Since the widely used electronic document format does not retain complete table structure information, an accurate table structure recognition technique is required, and the system extracts the table contents and uses the table contents for subsequent analysis.

Current table identification techniques include table detection and cell structure identification. Table detection is the detection of the boundaries of a table, and cell structure identification is the detection of cells within a table and the resolution of their logical relationships (e.g., identifying all cells within a row or column within a table). It is difficult for a computer automated system to accurately reconstruct form boundary and cell structure information from an electronic document because the style, layout, and content of a form may vary widely. The deep learning method can be applied to any electronic document which can be presented as an image by directly processing the image, and achieves an effect superior to that of the conventional method.

Although the prior art such as CN114529773A discloses a table identification method, system, terminal and medium based on structural units, and also adopts a deep learning method to process the occurrence to identify the table, it still depends excessively on the existing general target detection method, and cannot completely adapt to the unique attributes of the table. Recent deep learning-based studies are largely divided into two directions:

(1) The table structure is output as text by a machine translation method. This end-to-end method of outputting a table structure is still not accurate enough in overall structural representation;

(2) A table structure is generated by the detected objects in the table. This approach based on object detection can detect whole rows and columns separately and represent the intersection of the detected rows and columns as a cell, but it has limitations for complex table structures and cannot span or align the whole table.

Therefore, a table structure identification method based on cell constraint and an application thereof, which can analyze a table structure more accurately so as to obtain table content logic information, are urgently needed to solve the problems in the prior art.

Disclosure of Invention

The embodiment of the application provides a table structure identification method based on cell constraint and application thereof, aiming at the problems of large limitation, inaccuracy, incapability of crossing or whole table and the like in the prior art.

The core technology of the invention is mainly that a table information extraction framework is composed of a plurality of neural networks, and each main target detection network takes the output of other networks as input. The framework is applicable to any type of object detection network. Based on the objective fact that a table must contain cells, a table boundary network uses a cell detection network to detect table boundaries, and a cell structure recognition network recognizes cell structures by table boundaries and table styles.

In a first aspect, the present application provides a table structure identification method based on cell constraints, the method comprising the steps of:

s00, inputting different form images into corresponding target detection models according to form types to output position information of corresponding forms; the table position information comprises coordinates of any two nonadjacent vertexes of the table, and the types of the table are divided into a framed table and a frameless table;

s10, acquiring information of a whole row and a whole column of the table according to the position information of the table, wherein the information comprises coordinates of any two nonadjacent vertexes of each row and each column;

s20, identifying all texts and borders in the table;

s30, obtaining form structure information of the form through a machine translation method, wherein the form structure information comprises row and column quantity and layout information;

s40, correcting the cells in the table according to the position information of the table, the information of the whole row and the whole column of the table and the table structure information of the table, wherein the correction means comprises cell completion, cell combination and cell removal;

and S50, clustering the coordinates of the cells in the corrected table to generate a cell structure of the table.

Further, in step S00, the difference between the number of cells outside the table and the number of cells in the table is calculated to obtain the boundary of the table, and the boundary is used as the frame of the table, thereby outputting the position information of the corresponding table.

Further, in step S30, the table image is aligned with the text position according to the cell position of the table image, and is sampled in the vertical and horizontal directions, respectively, to determine the number of rows and columns of the table.

Further, in step S30, before sampling in the vertical direction, the left and right edges of the cell are spread without overlapping the existing cell to determine the row lacking the cell, and after determining the row number and the column number of the table, the alignment of the table in the vertical direction and the horizontal direction is deduced by finding the cell border closest to the complete alignment with the known cell, which serves as layout information.

Further, in step S40, cell borders are clustered to determine the position of each row and each column, and the positions of the rows and the columns are allocated according to the position of the cell border to merge the cells.

Further, in step S40, the logical relationship of the cells is determined according to the text content of the cells to merge the cells again, and the cell positions are assigned to the remaining text boxes that do not overlap any detected cells, while the cells with gaps in the vicinity are split.

Further, in step S40, when the border of the text intersects with the adjacent empty row or column, the span of the row or column of the cell is increased.

In a second aspect, the present application provides a table structure recognition apparatus based on cell constraint, including:

the input module is used for inputting different form images into corresponding target detection models according to the form types;

the target detection module is used for outputting the position information of the corresponding table through the corresponding target detection model according to the input table image; the table position information comprises coordinates of any two nonadjacent vertexes of the table, and the types of the table are divided into a framed table and a frameless table;

a row and column information acquisition module, configured to acquire information of a whole row and a whole column of the table according to the position information of the table, where the information includes coordinates of any two non-adjacent vertices in each row and each column;

the text recognition module is used for recognizing all texts and frames thereof in the table;

the machine translation module is used for acquiring the table structure information of the table through a machine translation method, wherein the table structure information comprises row and column quantity and layout information;

the correction module is used for correcting the cells in the table according to the position information of the table, the information of the whole row and the whole column of the table and the table structure information of the table, wherein the correction means comprises cell completion, cell combination and cell removal;

and the clustering output module is used for clustering the coordinates of the cells in the corrected table to generate a cell structure of the output table.

In a third aspect, the present application provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the above table structure identification method based on cell constraint.

In a fourth aspect, the present application provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process, the process comprising a table structure identification method based on cell constraints according to the above.

The main contributions and innovation points of the invention are as follows: 1. compared with the prior art, the method has the advantages that the characteristics of the table (generally, the table contains a certain number of cells, and the cells cannot exist outside the table) and the information of the whole table are fully utilized, a layered network system is provided, and meanwhile, the method developed based on the network system is used for analyzing the table structure more accurately, so that the logical information of the table content is obtained;

2. compared with the prior art, the method can be combined with any optical symbol recognition method, and the table structure (including the table position, the cell position and the logic relation among the cells) is analyzed from end to end from the electronic document through the complete table recognition technology, so that the subsequent content analysis is easier and more effective.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a table structure identification method based on cell constraints according to an embodiment of the present application;

FIG. 2 is a flow chart of a preferred embodiment of the present application;

fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Conventional form recognition techniques are based on rules or statistics, are often not universally applicable to different types of electronic documents, and rely heavily on manually set features (e.g., graphical lines or bold fonts). On the other hand, the table identification method based on deep learning requires a large-scale data set, and excessively depends on an existing general target detection method, and cannot completely adapt to unique attributes of the table.

Based on this, the present application solves the problems of the prior art based on image recognition. In practical applications, some tables have visible vertical and horizontal lines in rows and columns, so that the cell borders can be easily judged. However, some tables have only a dashed line or no regular line, so observing only their local environment will not determine whether a line represents a new cell. In the present application, vertical lines in the table are first sorted by an attribute network. Second, the output of the attribute network determines which cell is used to detect the network. The borderless table has almost no horizontal or vertical lines in the table, and the framed table forms cells by intersecting horizontal and vertical lines. The framed table network model is trained from an original table with a frame, and the frameless table network model adds enhanced data on the basis of the framed data (i.e., all or part of the frame is removed, so that the framed table is changed into a frameless table, thereby adding training samples of the frameless table).

Example one

Specifically, an embodiment of the present application provides a table structure identification method based on a cell constraint, and specifically, with reference to fig. 1 and fig. 2, the method includes:

in this step, the boundary of the table is obtained by calculating the difference between the number of cells outside the table and the number of cells inside the table, and the boundary is used as the frame of the table, thereby outputting the position information of the corresponding table.

In this embodiment, before input, the types of the table in the global document need to be distinguished. The types of tables can be classified into framed tables and frameless tables, and in practical applications, the types of tables can be classified according to business features (such as invoice tables, order tables and the like). The present application is therefore applicable to almost all forms classification approaches. As in the present embodiment, the coordinates of the upper left and lower right vertices of the table can be acquired.

Preferably, the table information extraction framework in the target detection model is composed of a plurality of neural networks, and each main target detection network takes the output of other networks as input. The framework is applicable to any type of object detection network. Based on the objective fact that a table must contain cells, a table boundary network uses a cell detection network to detect table boundaries, and a cell structure recognition network recognizes cell structures by table boundaries and table styles. The flow chart of the method of the invention is shown in figure 2.

In fig. 2, in the training phase of the target detection model, the loss function includes a regression loss, a classification loss, and a constraint loss. The regression and classification losses are the same as the loss functions of general target detection models, while the constraint losses penalize unrealistic table detection probabilities. Such a loss function based on cell constraints can be applied to any detection network.

For this purpose, the present application uses a cellular network to generate a set of cell boundaries, and the training data is the original electronic document picture without providing form location information. Table bounding box candidates for a given cell, the boundaries of the masked input regions are defined by two boolean operators (in table () and out table ()):

the two-value mask area of the frame cell is represented, and the area of the frame cell is represented. It is true if the cell coverage area between the sums is at most equal to the difference between the area of the coefficient multiplied by the sum.

D is true if there are any cells between the sums. The penalty function is defined as:

the frame cell is increased in length and width by a certain number of pixels, and the frame cell is extended toward the bottom by a certain number of pixels.

The penalty function is true when any of the following conditions is true:

1): the full table has a cell proportion less than the threshold.

2): the proportion of the area within the table having cells is less than the threshold.

3): there is a region of cells outside the table.

4): the proportion of the area with cells at the bottom of the table is less than the threshold.

The constraint loss function is then:

which is a probability function of the table.

In the inference stage, the detection probability is considered when determining the frame, and whether a cell exists in the table or not is also considered. I.e., cell bounding box candidates, the bounding coefficients of the bounding box are defined as:

the application therefore calculates each table boundary by subtracting the number of table interior cells from the number of table exterior cells. For any pair of frames

,

The overlap ratio is greater than the proportional threshold and

if the probability is less than the threshold, the bounding box with the higher constraint coefficient is discarded.

Most electronic document input contains more than tables, so the tables are found first and then are subjected to structured analysis. Thus, the table candidate in fig. 2 is obtained, that is, the position of the table in the electronic document is found, and a mask matrix is generated, so that a masked picture is generated. The table candidates are the outputs of the table detection model. There are many forms of masks, and the mask in this application is binary, i.e., a matrix of 1 or 0. In brief, a tabular region can be invalidated by dot multiplication with a tabular index of 1 and a tabular index of 0. This corresponds to obtaining the position information of the table.

S10, acquiring information of the whole row and the whole column of the table according to the position information of the table, wherein the information comprises coordinates of any two nonadjacent vertexes of each row and each column;

also, in the present embodiment, the coordinates of the upper left and lower right vertices of each row and each column are acquired. Thus, the overlapped part of the row and the column is the cell, and the cell is the minimum unit forming the table and can be split or combined. Both the entry and modification of the individual data is done in the cells.

The steps can realize the detection of the position of the table, and then facilitate the subsequent table identification operation, and the purpose is to analyze the position relation of the cells in the table, thereby generating a complete table cell structure.

S20, identifying all texts and frames thereof in the table;

in this step, any character recognition method can be used to detect all the texts and their borders in the table.

the machine translation means in the art that a text description is output for an input picture, and the actual meaning here is that a table picture is input, and a grammar (such as html language) describing a table structure is output, so that the general overall structure of the table can be known through the information.

In this step, the table image is aligned with the text position according to the cell position of the table image (according to the cell position in the complete electronic document picture with the mask), and is sampled in the vertical and horizontal directions, respectively, to determine the number of rows and columns of the table.

The mask English is a mask, the original image is an electronic document picture, and the mask is used for invalidating a non-table in the electronic document picture and reducing interference. The common electronic document pictures are not necessarily tables, and the main purpose of the present application is to analyze the structure of the tables. The text position is obtained through a text box obtained through the previous character recognition, and the cell judgment is the difficulty of the table structure recognition, so that the cell position is determined through different methods based on various detection principles.

Preferably, before sampling in the vertical direction, the left and right edges of the cell are spread without overlapping the existing cell to determine the row lacking the cell, and after determining the number of rows and columns of the table, the vertical and horizontal alignment of the table is deduced by finding the cell border that is closest to a complete alignment with the known cell, as layout information.

S40, correcting cells in the table according to the position information of the table, the information of the whole row and the whole column of the table and the table structure information of the table, wherein the correction means comprises completing the cells, combining the cells and removing the cells;

in this step, cell borders are clustered (K-means clustering method) to determine the position of each row and each column, and the positions of the rows and columns are assigned according to the box position of each cell to merge the cells.

Preferably, the logical relationship of the cells is determined from their text content (cell content is typically ordered in a regular manner starting with capital letters) to re-merge the cells and assign cell locations to the remaining text boxes that do not overlap any detected cells while splitting cells with gaps in the vicinity. The span of a cell's row or column (which may be a hierarchical cell spanning multiple rows or columns) is increased when the border of the text intersects an adjacent empty row or column.

Because the method of machine translation is relatively abstract (grammatical text description); while the box-related method is a purely visual algorithm, the output is the coordinates of the border lines. Machine translation can only obtain the rough structure of the table (such as the exact number of rows and columns), and the vision algorithm can obtain the exact coordinates, but some cells, columns or rows may be missed. The purpose of the application is that the two are combined to complete the whole information of the table.

Machine translation can only obtain how many rows and columns of a table are provided without any coordinate information, and the actual input of the clustering method is the coordinate information, and coordinate points with close distances are clustered into one class, so that the purpose of merging redundant cells is achieved. Most of the current visual algorithms adopt frames to judge the cells, however, once one frame is missed, the whole structure is completely disordered. The machine translation outputs relatively accurate row number and column number, which is equivalent to that a certain limiting condition is added for the frame, so that the detection of the frame coordinate is more accurate.

Thus, the application provides a vision-based form recognition system framework, which comprises form detection and cell structure recognition, and is suitable for any target detection model and any optical symbol recognition model. The table structure (including table positions, cell positions and logic relations among cells) can be analyzed from the electronic document end to end by combining any optical symbol recognition method through a complete table recognition technology, so that the subsequent content analysis is easier and more effective.

Preferably, the whole logic structure of the table can be obtained through a machine translation method, and the table structure information is supplemented in combination with a detection method based on visual rows and columns.

Example two

Based on the same concept, the application also provides a table structure recognition device based on cell constraint, which comprises:

a row and column information obtaining module, configured to obtain information of a whole row and a whole column of the table according to the position information of the table, where the information includes coordinates of any two non-adjacent vertices in each row and each column;

EXAMPLE III

The present embodiment also provides an electronic device, referring to fig. 3, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to execute the computer program to perform the steps of any of the above method embodiments.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, the memory 404 may include a hard disk drive (hard disk drive, abbreviated HDD), a floppy disk drive, a solid state drive (solid state drive, abbreviated SSD), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically erasable ROM (EEPROM), electrically Alterable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory 404 (FPMDRAM), an extended data output dynamic random-access memory (EDODRAM), a synchronous dynamic random-access memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by processor 402.

The processor 402 implements any of the cell constraint based table structure identification methods of the above embodiments by reading and executing computer program instructions stored in the memory 404.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module configured to communicate with the internet via wireless.

The input and output devices 408 are used to input or output information. In the present embodiment, the input information may be a form electronic document picture or the like, and the output information may be a form cell structure or the like.

Example four

The present embodiment also provides a readable storage medium having stored therein a computer program including program code for controlling a process to execute a process, the process including the cell constraint-based table structure identification method according to the first embodiment.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also called program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. The table structure identification method based on the cell constraint is characterized by comprising the following steps of:

s00, inputting different form images into corresponding target detection models according to form types so as to output position information of corresponding forms; the position information of the table comprises coordinates of any two non-adjacent vertexes of the table, and the types of the table are divided into a framed table and a frameless table;

s20, identifying all texts and borders in the table;

s30, obtaining table structure information of the table through a machine translation method, wherein the table structure information comprises row and column quantity and layout information;

s40, correcting the cells in the table according to the position information of the table, the information of the whole row and the whole column of the table and the table structure information of the table, wherein the correction means comprises cell completion, cell merging and cell removal;

s50, clustering the coordinates of the cells in the corrected table to generate a cell structure of the table.

2. The method for identifying a table structure based on cell constraints as claimed in claim 1, wherein in step S00, the boundary of the table is obtained by calculating the difference between the number of cells outside the table and the number of cells inside the table, and the boundary is used as the border of the table, so as to output the position information of the corresponding table.

3. The cell-constraint-based table structure recognition method of claim 1, wherein in step S30, the table image is aligned with a text position according to a cell position of the table image, and is sampled in vertical and horizontal directions, respectively, to determine a row number and a column number of the table.

4. The cell-constraint-based table structure recognition method of claim 3, wherein, in step S30, before sampling in the vertical direction, the left and right edges of the cell are spread without overlapping with the existing cell to determine the row lacking the cell, and after determining the row number and the column number of the table, the vertical and horizontal alignment of the table is deduced by finding the cell border closest to complete alignment with the known cell, which serves as layout information.

5. The method of claim 4, wherein in step S40, cell borders are clustered to determine the position of each row and each column, and the positions of the rows and columns are allocated according to the box position of each cell to merge the cells.

6. The cell constraint-based table structure identification method of claim 5, wherein in step S40, the logical relationship of the cells is determined according to the text content of the cells, so as to merge the cells again, and the cell positions are allocated to the remaining text boxes which do not overlap with any detected cells, while the nearby gapped cells are split.

7. The cell-constraint-based table structure recognition method of claim 6, wherein in step S40, when the border of the text intersects with an adjacent empty row or column, the span of the row or column of the cell is increased.

8. A table structure recognition apparatus based on cell constraint, comprising:

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for identifying a table structure based on cell constraints according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that a computer program is stored therein, the computer program comprising program code for controlling a process to execute a process, the process comprising the cell constraint-based table structure identification method according to any one of claims 1 to 7.