CN116631000A

CN116631000A - Table reconstruction method based on semantic segmentation and text recognition

Info

Publication number: CN116631000A
Application number: CN202310222396.3A
Authority: CN
Inventors: 胡彦鹏; 沙曼; 宋海川; 汤昌林; 张伟建; 郑智鸿
Original assignee: Shanghai Huaxin Co ltd
Current assignee: Shanghai Huaxin Co ltd
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-08-22

Abstract

The invention relates to a table reconstruction method, a device, a terminal and a computer readable storage medium based on semantic segmentation and text recognition. According to the method, the convolutional neural network is utilized to automatically extract image features and conduct repeated semantic segmentation on the image, so that when the table in the image shot under the complex condition is reconstructed, deviation from a real table can be reduced.

Description

Table reconstruction method based on semantic segmentation and text recognition

Technical Field

The invention relates to the field of computer data processing and image processing, in particular to a method for image semantic recognition and table reconstruction based on a machine learning algorithm.

Background

With the rapid growth of document applications, particularly tabular documents, automatically extracting the form from the document and representing it in a structured manner, i.e., form reconstruction, is an important and challenging task. The table reconstruction comprises table detection and table structure detection, wherein the table detection refers to detecting the area where the table is located from one page, and the table structure identification refers to identifying the content and the structure of the table on the basis of the detected table area.

In the prior art, the lines in the picture are usually detected based on a traditional image processing method, such as a corrosion expansion operation, and table reconstruction is performed according to the intersection coordinates between the lines. However, due to factors such as shooting conditions, document quality and the like, the traditional image processing method cannot well solve distortion, deformation or lines with break points, so that a reconstructed form is deviated from an actual form.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a table reconstruction method based on semantic segmentation and text recognition, which utilizes a convolutional neural network to automatically extract image features and perform semantic segmentation on an image for a plurality of times, so that when a table in an image shot under a complex condition is reconstructed, the deviation between the image and a real table can be reduced.

According to a first aspect of the present invention, a table reconstruction method based on semantic segmentation and text recognition is disclosed, the method comprising:

step S1, text region detection: performing first semantic segmentation on the table text picture to obtain a text region, and then using a minimum area rectangular frame algorithm to obtain a text region coordinate;

step S2, identifying text areas: extracting features of the text region, and predicting the extracted feature sequence as text content;

step S3, predicting table coordinates: performing secondary semantic segmentation on the table text picture to obtain table grid lines, and performing post-processing to obtain table coordinates and cell coordinates;

step S4, matching the text with the table: and matching the text content with the table cells according to the text region coordinates and the cell coordinates to obtain a target table.

Preferably, the step S1 specifically includes:

extracting a first feature map of a table text picture by using a first convolutional neural network, and eliminating an aliasing effect by upsampling; performing dimension reduction operation on the first feature map with unchanged dimension, and restoring the first feature map to a first probability map with original map size by using transpose convolution; and converting the first probability map into a first segmentation binary map by using a first segmentation probability threshold, wherein a region with a value of 1 is a text region, and then obtaining the minimum circumscribed rectangle of the text region as a text region coordinate.

Preferably, the first convolutional neural network is a convolutional neural network based on an FPN structure.

Preferably, the minimum bounding rectangle is obtained by using a Graham algorithm.

Preferably, the step S2 specifically includes: constructing an affine transformation matrix according to the minimum circumscribed rectangle, cutting out a corresponding region in the text picture of the table, and correcting the cut picture into a rectangle through the affine transformation so as to obtain a plain text picture; calculating a scaling factor of the plain text graph according to a preset fixed height, and adjusting the height of the plain text graph to a fixed size by using a multi-time bilinear interpolation method under the condition of not changing the scale; extracting a feature map of the plain text map through convolution operation and maximum pooling operation; and dividing the feature map of the plain text map into feature vector sequences according to the width, then using a long-short-term memory network to conduct sequence prediction to obtain a text sequence, and merging the same and adjacent characters in the text sequence to obtain a text recognition result.

Preferably, the step S3 specifically includes: extracting a second characteristic diagram of the table text picture by using a second convolutional neural network, performing downsampling by convolution operation and maximum pooling operation, performing upsampling by deformable convolution, mixing the second characteristic diagrams with different granularities, and improving the acceptance domain of the second characteristic diagram; based on the second feature map, reducing the number of channels through convolution operation, obtaining a second probability map consistent with the width and the height of the original map, and predicting to obtain a horizontal line and a vertical line in a table; processing the second probability map by using a second segmentation probability threshold to obtain a second segmentation binary map of the horizontal line and the vertical line; and taking intersection of the second segmentation binary image to obtain a table binary image, setting a table line area and a table cell area to be different values, and respectively solving corresponding maximum circumscribed rectangles as table coordinates and cell coordinates.

Preferably, wherein the second convolutional neural network is a U-Net structure based convolutional neural network.

Preferably, the first convolutional neural network and the second convolutional neural network may be different convolutional neural networks.

Preferably, the first segmentation probability threshold and the segmentation probability threshold may be the same value.

Preferably, the maximum bounding rectangle is obtained by using a Graham algorithm.

Preferably, the step S4 includes: calculating a center point of the text region coordinate, and matching with the table cells, wherein if the center point is in the table cells, the text region is successfully matched with the table cells.

Preferably, the step S4 further includes: if the text areas in the same table cell have intersections on the ordinate, the text areas are sequentially combined according to the descending order of the abscissa, and the combined text areas are sequentially combined according to the ordinate to obtain a table reconstruction result.

According to a second aspect of the present invention, there is disclosed a table reconstruction device based on semantic segmentation and text recognition, the device comprising:

text region detection unit: performing first semantic segmentation on the table text picture to obtain a text region, and then using a minimum area rectangular frame algorithm to obtain a text region coordinate;

text region recognition unit: extracting features of the text region, and predicting the extracted feature sequence as text content;

form coordinate prediction unit: performing second semantic segmentation on the table text picture to obtain table grid lines, and performing post-processing to obtain table coordinates and cell coordinates;

text and form matching unit: and matching the text content with the table cells according to the text region coordinates and the cell coordinates to obtain a target table.

According to a third aspect of the present invention, a terminal is disclosed comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor being operable to perform the method of any of the preceding claims when executing the computer program.

According to a fourth aspect of the present invention, a computer-readable storage medium is disclosed, having stored thereon a computer program, which, when executed by a processor, is operable to perform a method according to any of the preceding claims.

The method has the advantages that the neural network algorithm is applied to text region detection and form coordinate prediction, the region can be accurately and rapidly detected and the coordinates can be predicted, the identification is not needed, and meanwhile the probability of self-correction is improved.

The text recognition method has the advantages that a long short-term memory (LSTM) is used for sequence prediction to obtain a text sequence, the same and adjacent characters in the text sequence are combined to obtain a final text recognition result, semantic information in text lines can be extracted from the front direction and the back direction, and the text line recognition task is facilitated.

The method has the beneficial effects that the text region coordinates and the cell region coordinates are determined respectively, and on the basis, the text region is matched with the table cells, so that the matching and the table reconstruction can be accurately carried out in sequence, and the error probability is reduced.

The method has the beneficial effects that when the table image of the lines with distortion, deformation or break points is processed, the deviation between the reconstructed table and the real table is smaller, and the method has a certain application prospect in the field of table reconstruction. Aiming at the problem that the traditional method is difficult to process the table with the diversified structure, the invention also extracts the image characteristics through the deep learning technology, thereby improving the robustness of the algorithm and enhancing the effect of table reconstruction.

Drawings

FIG. 1 is a flow chart of a table reconstruction method based on semantic segmentation and text recognition provided by the invention;

FIG. 2 is a flowchart illustrating a text region detection step of the table reconstruction method according to the present invention;

FIG. 3 is a flowchart illustrating a text region identification step of the table reconstruction method according to the present invention;

FIG. 4 is a flowchart illustrating a table coordinate prediction step of the table reconstruction method according to the present invention;

fig. 5 is a flowchart of a text-to-form matching step of the form reconstruction method provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

referring to fig. 1, an image with a table text is used as input, on one hand, text region coordinate prediction and text recognition are performed, on the other hand, table region coordinate prediction is performed, and finally, table cell matching is performed on the text region coordinate with the recognition result and the table region coordinate, so that a final table reconstruction result is obtained.

Referring to fig. 2, step S1 predicts text region coordinates of an image with a table text, predicts a text probability map after extracting text features, converts the text probability map into a segmentation binary map by using a fixed threshold, and finally obtains a minimum circumscribed rectangle of the text region, namely text region coordinates by using a graham algorithm, specifically comprising the following steps:

s100: extracting feature maps of 1/2, 1/4, 1/8, 1/16 and 1/32 of the original image size proportion respectively by using a convolution neural network based on a FPN (FeaturePyramidNetworks) structure; according to the sequence from small to large of the feature images, performing double up-sampling by using deformable convolution in sequence, connecting the deformable convolution with the feature images with the same size extracted before, and eliminating the aliasing effect caused by up-sampling through convolution with the size of 3; and finally, upsampling the feature map for eliminating the aliasing effect again to unify the feature map with the size of 1/4 original map.

S110: and (3) performing dimension reduction operation with unchanged dimension on the feature map obtained in the step (S100), simplifying the feature while reducing the calculated amount, performing up-sampling by using two transposition convolutions, and recovering the feature map to a probability map with the original map size and the dimension of 1, wherein the numerical value of the probability map represents the probability that the pixel point belongs to the text region.

S120: and setting the value of the region with the probability value larger than 0.8 in the probability map as 1 and the values of the other regions as 0 by using a preset first segmentation probability threshold value of 0.8 to obtain a segmentation binary map, wherein the region with the value of 1 is a text region.

S130: and (5) obtaining the minimum circumscribed rectangle of the text region by using a Graham algorithm, namely the text region coordinates.

Referring to fig. 3, step S2 performs text recognition on a text region, calculates an affine transformation matrix according to coordinates of the text region, maps the text region into a rectangle using affine transformation, adjusts its height to a fixed height, extracts image features, and converts the image features into a text sequence for prediction, thereby obtaining a text recognition result, and specifically includes the following steps:

s200: according to the minimum circumscribed rectangle obtained in the step S1, an affine transformation matrix is constructed, and the affine transformation matrix is decomposed into an R (rotation) matrix of 3x3 and a t (translation) vector of 1x 3.

S210: and cutting a corresponding region in the original image with the table text, and correcting the cut image into a rectangle through affine transformation to obtain a plain text image.

S220: according to the preset fixed height 32, the scaling factor of the plain text graph is calculated, and the height of the plain text graph is adjusted to 32 by using a plurality of bilinear interpolation methods without changing the scale.

S230: extracting the feature map of the plain text map, and obtaining the feature map with the width of 1/4, the height of 1 and the channel number of 512 of the plain text map through 3x3 convolution kernel and the maximum pooling operation.

S240: the feature map is divided into feature vector sequences with length 512 according to the width, a long short-term memory (LSTM) is used for sequence prediction to obtain a text sequence, and the same and adjacent characters in the text sequence are combined to obtain a final text recognition result.

Referring to fig. 4, step S3 predicts the coordinates of the table region of the image with the table text, extracts the features, predicts the probability map of the table, converts the probability map into a split binary map by using a fixed threshold, and finally obtains the minimum bounding rectangle of the table region and the table cells, i.e. the coordinates of the table and the coordinates of the cells by using a graham algorithm, and specifically comprises the following steps:

s300: obtaining a characteristic diagram of 1/2, 1/4, 1/8 and 1/16 of the size proportion of the image with the table text through a convolution kernel of 3x3 and a maximum pooling operation of 2x 2; and then up-sampling the feature images sequentially from small to large through 2-dimensional deformable convolution of a 2x2 convolution kernel, connecting the feature images with the same size, and finally obtaining the feature images with the same width and height as the original image and 128 channels.

S310: based on the feature map obtained in the step S300, the number of channels is reduced to 64 through convolution of a convolution kernel of 3x3 twice, and finally, a probability map which is consistent with the width and the height of the original map and has 2 channels is obtained through convolution of a convolution kernel of 1x1, so that the horizontal line and the vertical line in the table are respectively predicted.

S320: and setting the numerical value of the region with the probability value larger than 0.8 in the probability map as 1 and the numerical values of the other regions as 0 by using a preset second segmentation probability threshold value of 0.8, and respectively obtaining segmentation binary maps of the horizontal lines and the vertical lines.

S330: intersection is taken from the split binary images of the horizontal lines and the vertical lines to obtain a table binary image, the value of the area outside the table grid line is set to be 2, and the maximum circumscribed rectangles of the areas with the values of 1 and 0, namely the coordinates of the table area and the coordinates of the cell area are respectively obtained by using a Graham algorithm.

Referring to fig. 5, step S4 matches the predicted coordinates of the table area with the coordinates of the text area with the recognition result, determines whether the text is in the table area according to the center point of the text area, and then merges the texts in the same cell, and the specific steps are as follows:

s400: and calculating a center point of the text region coordinate, and matching with the table cells, wherein if the center point is in the table cells, the text region is successfully matched with the table cells.

S410: if the text areas in the same table cell have intersections on the ordinate, the text areas are sequentially combined in the order from smaller to larger according to the abscissa, and the combined text areas are sequentially combined according to the ordinate of the central point of the area, so that the text false recognition possibly caused by line feed due to the size of the table is avoided.

Example 2:

in order to implement the steps in embodiment 1 to implement the corresponding technical effects, the form reconstruction method may be implemented in a hardware device or in a form of a software module, and when the form reconstruction method is implemented in a form of a software module, there is further provided a form reconstruction device based on semantic segmentation and text recognition, where the functional modules include: the text region detection unit, the text region identification unit, the table coordinate prediction unit and the text and table matching unit.

The text region detection unit performs the steps of: extracting a first feature map of a table text picture by using a first convolutional neural network, and eliminating an aliasing effect by upsampling; performing dimension reduction operation on the first feature map with unchanged dimension, and restoring the first feature map to a first probability map with original map size by using transpose convolution; and converting the first probability map into a first segmentation binary map by using a first segmentation probability threshold, wherein a region with a value of 1 is a text region, and then obtaining the minimum circumscribed rectangle of the text region as a text region coordinate.

The text region identification unit performs the steps of: constructing an affine transformation matrix according to the minimum circumscribed rectangle, cutting out a corresponding region in the text picture of the table, and correcting the cut picture into a rectangle through the affine transformation so as to obtain a plain text picture; calculating a scaling factor of the plain text graph according to a preset fixed height, and adjusting the height of the plain text graph to a fixed size by using a multi-time bilinear interpolation method under the condition of not changing the scale; extracting a feature map of the plain text map through convolution operation and maximum pooling operation; and dividing the feature map of the plain text map into feature vector sequences according to the width, then using a long-short-term memory network to conduct sequence prediction to obtain a text sequence, and merging the same and adjacent characters in the text sequence to obtain a text recognition result.

The table coordinate prediction unit performs the steps of: extracting a second characteristic diagram of the table text picture by using a second convolutional neural network, performing downsampling by convolution operation and maximum pooling operation, performing upsampling by deformable convolution, mixing the second characteristic diagrams with different granularities, and improving the acceptance domain of the second characteristic diagram; based on the second feature map, reducing the number of channels through convolution operation, obtaining a second probability map consistent with the width and the height of the original map, and predicting to obtain a horizontal line and a vertical line in a table; processing the second probability map by using a second segmentation probability threshold to obtain a second segmentation binary map of the horizontal line and the vertical line; and taking intersection of the second segmentation binary image to obtain a table binary image, setting a table line area and a table cell area to be different values, and respectively solving corresponding maximum circumscribed rectangles as table coordinates and cell coordinates.

The text and form matching unit performs the steps of: calculating a center point of the text region coordinate, and matching with the table cells, wherein if the center point is in the table cells, the text region is successfully matched with the table cells; if the text areas in the same table cell have intersections on the ordinate, the text areas are sequentially combined according to the descending order of the abscissa, and the combined text areas are sequentially combined according to the ordinate to obtain a table reconstruction result.

When a table image with poor quality is processed, the hardware equipment comprising the table reconstruction device extracts image features through a deep learning technology, so that the robustness of an algorithm is improved, and the image is subjected to multiple semantic segmentation by using different convolutional neural networks, so that the deviation between the reconstructed table and a real table can be obviously reduced.

In describing embodiments of the present invention, it should be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "center", "top", "bottom", "inner", "outer", "inside", "outside", etc. indicate orientations or positional relationships based on the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Wherein "inside" refers to an interior or enclosed area or space. "peripheral" refers to the area surrounding a particular component or region.

In the description of embodiments of the present invention, the terms "first," "second," "third," "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", "a third" and a fourth "may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In describing embodiments of the present invention, it should be noted that the terms "mounted," "connected," and "assembled" are to be construed broadly, as they may be fixedly connected, detachably connected, or integrally connected, unless otherwise specifically indicated and defined; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In the description of embodiments of the invention, a particular feature, structure, material, or characteristic may be combined in any suitable manner in one or more embodiments or examples.

In the description of the embodiments of the present invention, it is to be understood that "-" and "-" denote the same ranges of the two values, and the ranges include the endpoints. For example: "A-B" means a range greater than or equal to A and less than or equal to B. "A-B" means a range of greater than or equal to A and less than or equal to B.

In the description of embodiments of the present invention, the term "and/or" is merely an association relationship describing an association object, meaning that three relationships may exist, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A table reconstruction method based on semantic segmentation and text recognition, the method comprising:

step S3, predicting table coordinates: performing second semantic segmentation on the table text picture to obtain table grid lines, and performing post-processing to obtain table coordinates and cell coordinates;

2. The method for reconstructing a table based on semantic segmentation and text recognition according to claim 1, wherein said step S1 specifically comprises:

extracting a first feature map of a table text picture by using a first convolutional neural network, and eliminating an aliasing effect by upsampling;

performing dimension reduction operation on the first feature map with unchanged dimension, and restoring the first feature map to a first probability map with original map size by using transpose convolution;

and converting the first probability map into a first segmentation binary map by using a first segmentation probability threshold, wherein a region with a value of 1 is a text region, and then obtaining the minimum circumscribed rectangle of the text region as a text region coordinate.

3. The method of claim 2, wherein the first convolutional neural network is a convolutional neural network based on FPN structure.

4. The method for reconstructing a table based on semantic segmentation and text recognition according to claim 1, wherein said step S2 specifically comprises:

constructing an affine transformation matrix according to the minimum circumscribed rectangle, cutting out a corresponding region in the text picture of the table, and correcting the cut picture into a rectangle through the affine transformation so as to obtain a plain text picture;

calculating a scaling factor of the plain text graph according to a preset fixed height, and adjusting the height of the plain text graph to a fixed size by using a multi-time bilinear interpolation method under the condition of not changing the scale;

extracting a feature map of the plain text map through convolution operation and maximum pooling operation;

and dividing the feature map of the plain text map into feature vector sequences according to the width, then using a long-short-term memory network to conduct sequence prediction to obtain a text sequence, and merging the same and adjacent characters in the text sequence to obtain a text recognition result.

5. The method for reconstructing a table based on semantic segmentation and text recognition according to claim 1, wherein said step S3 specifically comprises:

extracting a second characteristic diagram of the table text picture by using a second convolutional neural network, performing downsampling by convolution operation and maximum pooling operation, performing upsampling by deformable convolution, mixing the second characteristic diagrams with different granularities, and improving the acceptance domain of the second characteristic diagram;

based on the second feature map, reducing the number of channels through convolution operation, obtaining a second probability map consistent with the width and the height of the original map, and predicting to obtain a horizontal line and a vertical line in a table;

processing the second probability map by using a second segmentation probability threshold to obtain a second segmentation binary map of the horizontal line and the vertical line;

and taking intersection of the second segmentation binary image to obtain a table binary image, setting a table line area and a table cell area to be different values, and respectively solving corresponding maximum circumscribed rectangles as table coordinates and cell coordinates.

6. The semantic segmentation and text recognition based table reconstruction method according to claim 5, wherein the second convolutional neural network is a U-Net structure based convolutional neural network.

7. The method for table reconstruction based on semantic segmentation and text recognition according to claim 1, wherein the step S4 comprises:

calculating a center point of the text region coordinate, and matching with the table cells, wherein if the center point is in the table cells, the text region is successfully matched with the table cells.

8. The method for table reconstruction based on semantic segmentation and text recognition according to claim 1, wherein the step S4 further comprises:

if the text areas in the same table cell have intersections on the ordinate, the text areas are sequentially combined according to the descending order of the abscissa, and the combined text areas are sequentially combined according to the ordinate to obtain a table reconstruction result.

9. A table reconstruction device based on semantic segmentation and text recognition, the device comprising:

10. A terminal comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor being operable to perform the method of any one of claims 1-8 when the computer program is executed.