CN116912863A

CN116912863A - Form identification method and device and related equipment

Info

Publication number: CN116912863A
Application number: CN202211562328.3A
Authority: CN
Inventors: 郑慧; 贾千文
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-10-20

Abstract

The invention discloses a form identification method, a form identification device and related equipment, relates to the technical field of artificial intelligence, and aims to solve the problem of low accuracy of a form identification result. The method comprises the following steps: based on a pre-trained semantic segmentation model, acquiring a first target image, a second target image, a third target image and a fourth target image corresponding to the first form image; acquiring table structure information and text information corresponding to the first table image based on the first target image, the second target image, the third target image and the fourth target image; the first target image includes image information characterizing a horizontal line of the first form image, the second target image includes image information characterizing a vertical line of the first form image, the third target image includes image information characterizing a text horizontal edge of the first form image, and the fourth target image includes image information characterizing a text vertical edge of the first form image. The embodiment of the invention can improve the accuracy of the table identification structure.

Description

Form identification method and device and related equipment

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for identifying a form, and related devices.

Background

A form is a special information expression structure, and is generally composed of a plurality of horizontal frame lines and a plurality of vertical frame lines, and text information can be contained in cells in the form. In sorting table files stored in unstructured digital files (e.g., in picture format), it is often necessary to identify the structure of the table as well as the text content.

In the prior art, all line segments in a table are directly identified by taking the line segments as units. Because the frame lines and the characters in the table comprise various line segments, the direct recognition of all the line segments in the table can lead to larger interference between the character line segments and the frame lines of the table, and the accuracy of the table recognition result is lower.

Disclosure of Invention

The embodiment of the invention provides a form identification method, a form identification device and related equipment, which are used for solving the problem of low accuracy of a form identification result.

In a first aspect, an embodiment of the present invention provides a method for identifying a table, including:

based on a pre-trained semantic segmentation model, acquiring a first target image, a second target image, a third target image and a fourth target image corresponding to the first form image;

acquiring table structure information and text information corresponding to the first table image based on the first target image, the second target image, the third target image and the fourth target image;

The first target image comprises image information representing a horizontal frame line of the first table image, the second target image comprises image information representing a vertical frame line of the first table image, the third target image comprises image information representing a text horizontal edge of the first table image, and the fourth target image comprises image information representing a text vertical edge of the first table image.

In a second aspect, an embodiment of the present invention further provides a form identifying apparatus, including:

the first acquisition module is used for acquiring a first target image, a second target image, a third target image and a fourth target image corresponding to the first form image based on a pre-trained semantic segmentation model;

the second acquisition module is used for acquiring table structure information and text information corresponding to the first table image based on the first target image, the second target image, the third target image and the fourth target image;

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a program stored on the memory and executable on the processor;

the processor is configured to read a program in a memory to implement the steps in the method according to the first aspect.

In a fourth aspect, embodiments of the present application further provide a readable storage medium storing a program, which when executed by a processor implements the steps of the method according to the first aspect.

In the embodiment of the application, based on a pre-trained semantic segmentation model, a first target image, a second target image, a third target image and a fourth target image corresponding to a first table image are acquired, and based on the first target image, the second target image, the third target image and the fourth target image, table structure information and text information corresponding to the first table image are acquired. By the method, the character edges and the target frame lines are respectively subjected to feature extraction, so that the table structure information and the character information are respectively obtained, mutual interference between the frame lines and the characters in the table can be avoided, and the recognition accuracy of the table is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is one of the flowcharts of a form identification method provided by an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a table block diagram provided by an embodiment of the present invention;

FIG. 4 is a flowchart of text information recognition provided by an embodiment of the present invention;

FIG. 5 is a second flowchart of a table identification method according to an embodiment of the present invention;

FIG. 6 is a flowchart of labeling a sample image according to an embodiment of the present invention;

FIG. 7a is a second schematic diagram of a semantic segmentation model according to an embodiment of the present invention;

FIG. 7b is a schematic diagram of the convolution stage provided in FIG. 7 a;

FIG. 7c is a schematic diagram of the upsampling phase provided in FIG. 7 a;

FIG. 8 is a flowchart of table structure information determination provided by an embodiment of the present invention;

Fig. 9 is a schematic structural diagram of a table identifying apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the application, fall within the scope of protection of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated object is an "or" before and after.

Referring to fig. 1, fig. 1 is one of flowcharts of a table recognition method according to an embodiment of the present invention, as shown in fig. 1, the method specifically includes the following steps:

step 101, based on a pre-trained semantic segmentation model, a first target image, a second target image, a third target image and a fourth target image corresponding to the first table image are acquired.

Step 102, acquiring table structure information and text information corresponding to the first table image based on the first target image, the second target image, the third target image and the fourth target image.

It should be understood that the first form image is an image including the first form. Illustratively, in some embodiments, the first form image is a photograph of the first form or a printed form file.

In this embodiment, the horizontal frame lines and the vertical frame lines are mainly used to distinguish between different frame lines in the table, and do not represent a limitation on the frame line direction. In some embodiments, the horizontal wire may also be referred to as a cross wire in a form or a form cross frame, and the vertical wire may also be referred to as a column wire in a form or a form column frame. Since the table is composed of the horizontal frame lines and the vertical frame lines interleaved, the table structure information can be determined based on the first target image and the second target image.

In some embodiments, the first target image includes only horizontal lines in the first form image, the second target image includes only vertical lines in the first form image, the third target image includes only horizontal edges of text in the first form image, and the fourth target image includes only vertical edges of text in the first form image. Typically, a form is composed of two parts, one part is a wire structure of the form, the wire structure of the form can divide the form into a plurality of different cells, and the other part is text information located in the cells. Because both the frame wire and the text comprise line segments, under the condition that the frame wire and the text are relatively close, interference exists between the line segments of the frame wire and the line segments of the text.

In particular, based on the table structure information and the text information, the table files with different formats can be displayed according to actual requirements.

Optionally, in some embodiments, the step 101 includes:

rotating the first form image along a first direction by a target angle to obtain a second form image;

based on the semantic segmentation model, a first segmentation map and a second segmentation map corresponding to the first table image are obtained, and a third segmentation map and a fourth segmentation map corresponding to the second table image are obtained;

Determining a first target image, a second target image, a third target image and a fourth target image corresponding to the first table image based on the first segmentation map, the second segmentation map, the third segmentation map and the fourth segmentation map;

the first segmentation map comprises image information representing a target frame line of the first table image, the second segmentation map comprises image information representing a text edge of the first table image, the third segmentation map comprises image information representing a target frame line of the second table image, and the fourth segmentation map comprises image information representing a text edge of the second table image;

the target frame line comprises a horizontal frame line or a vertical frame line, the horizontal frame line rotates along the first direction for a target angle and then is parallel to the vertical frame line, when the target frame line is the horizontal frame line, the text edge is the text horizontal edge, and when the target frame line is the vertical frame line, the text edge is the text vertical edge.

It should be noted that in some embodiments, the horizontal wire and the vertical wire are in a vertical relationship, and the target angle is 90 °. Of course, depending on the actual situation, the horizontal wire and the vertical wire may not be in a strictly vertical relationship, and the range of the target angle is not strictly limited herein.

The horizontal frame line is parallel to the vertical frame line after rotating the first table image along the first direction by the target angle, so that the horizontal frame line in the first table image is parallel to the vertical frame line in the second table image after rotating the first table image along the first direction by the target angle to obtain the second table image, and the vertical frame line in the first table image is parallel to the horizontal frame line in the second table image.

The semantic segmentation model can realize the function of edge detection. In some embodiments, the semantic segmentation model may be caused to extract features of the horizontal edges by pre-training the semantic segmentation model, while the horizontal edges are separated into horizontal frame lines and text horizontal edges. In other embodiments, the semantic segmentation model may be pre-trained to extract features of vertical edges, while the vertical edges are divided into vertical frame lines and text vertical edges.

For ease of understanding, the description will be given below taking as an example the feature that the semantic segmentation model extracts a horizontal edge.

Inputting the first table image into a semantic segmentation model, and extracting the characteristics of the horizontal edge in the first table image by the semantic segmentation model, so that the image information of the horizontal frame line of the first table image and the image information of the text horizontal edge of the first table image can be obtained.

And inputting the second table image into a semantic segmentation model, and extracting the characteristics of the horizontal edge in the second table image by the semantic segmentation model, so that the image information of the horizontal frame line of the second table image and the image information of the text horizontal edge of the second table image can be obtained.

Since the second form image is obtained by rotating the first form image by the target angle in the first direction, the image information of the horizontal frame line of the second form image is actually the image information of the vertical frame line of the first form image, and the image information of the text horizontal edge of the second form image is actually the image information of the text vertical edge of the first form image.

From this, it can be seen that, after the first form image is rotated by the target angle along the first direction to obtain the second form image, the first form image and the second form image are input into the semantic segmentation model, so that the image information of the horizontal frame line, the vertical frame line, the horizontal text edge and the vertical text edge in the first form image can be obtained.

The embodiment of extracting the features of the vertical edges by the semantic segmentation model can be referred to the description of the above embodiment, and in order to avoid repetition, the description is omitted here. In the embodiment of the semantic segmentation model extracting the features of the vertical edges, finally, the image information of the horizontal frame line, the vertical frame line, the text horizontal edge and the text vertical edge in the first table image can be obtained.

In the embodiment of the application, the semantic segmentation model only extracts the features of the horizontal edge or the features of the vertical edge. And rotating the first table image along the first direction by a target angle to obtain a second table image, namely acquiring image information of a horizontal frame line, a vertical frame line, a text horizontal edge and a text vertical edge in the first table image based on the semantic segmentation model. Compared with the condition that the semantic segmentation model is used for simultaneously extracting the features of the horizontal edge and the features of the vertical edge of the first table image, the method provided by the application can reduce the processing amount of filtering processing, improve the efficiency and accuracy of feature extraction, and further improve the efficiency of table identification.

Optionally, in some embodiments, where the target frame line is the horizontal frame line, the determining the first, second, third, and fourth target images based on the first, second, third, and fourth segmentation maps includes:

determining the first segmentation map as the first target image, determining the second segmentation map as the third target image, rotating the third segmentation map by the target angle along a second direction to obtain the second target image, and rotating the fourth segmentation map by the target angle along the second direction to obtain the fourth target image, wherein the second direction is opposite to the first direction.

Optionally, in some embodiments, where the target frame line is the vertical frame line, the determining the first, second, third, and fourth target images based on the first, second, third, and fourth segmentation maps includes:

the first segmentation map is rotated by the target angle along a second direction to obtain the first target image, the second segmentation map is rotated by the target angle along the second direction to obtain the third target image, the third segmentation map is determined to be the second target image, and the fourth segmentation map is determined to be the fourth target image, wherein the second direction is opposite to the first direction.

In the embodiment of the application, when the target frame line is a horizontal frame line or a vertical frame line, the first target image, the second target image, the third target image and the fourth target image can be obtained by only rotating the corresponding two of the first segmentation map, the second segmentation map, the third segmentation map and the fourth segmentation map. By rotating the image, the processing amount of filtering processing can be reduced, the flow of table processing is simplified, and the table identification efficiency is improved.

Optionally, in some embodiments, the acquiring, based on the semantic segmentation model, a first segmentation map and a second segmentation map corresponding to the first table image, and acquiring a third segmentation map and a fourth segmentation map corresponding to the second table image includes:

inputting the first table image and the second table image into the semantic segmentation model to obtain a first semantic segmentation map corresponding to the first table image and a second semantic segmentation map corresponding to the second table image;

acquiring a first segmentation map and a second segmentation map corresponding to the first table image based on the first semantic segmentation map, and acquiring a third segmentation map and a fourth segmentation map corresponding to the second table image based on the second semantic segmentation map.

And inputting the first table image into a semantic segmentation model, and outputting the first table image into a first semantic segmentation graph corresponding to the first table image. Each pixel (may also be referred to as a coordinate position or a pixel point) in the first semantic segmentation map corresponds to category label information, so as to represent the category of the pixel as a background, a target frame line or a text edge.

Based on the class labeling information corresponding to each pixel, the pixels with the class as the target frame line can be extracted independently to obtain a first segmentation map corresponding to the first table image, and the pixels with the class as the text edges are extracted independently to obtain a second segmentation map corresponding to the first table image.

Illustratively, in some embodiments, the first table image is input into a semantic segmentation model resulting in a first semantic segmentation map. Each pixel in the first semantic segmentation graph corresponds to category labeling information, wherein the category labeling information is 0 and used for representing the category of the pixel is a background, the category labeling information is 1 and used for representing the category of the pixel is a target frame line, and the category labeling information is 2 and used for representing the category of the pixel is a text edge.

A pixel with category label information of 1 in the first semantic segmentation map is determined and the value of the pixel is set to 255. Meanwhile, the values of other pixels (pixels with category label information of 0 and 2) in the first semantic segmentation map are set to 0, so that the first segmentation map is obtained.

A pixel with category label information of 2 in the first semantic segmentation map is determined and the value of the pixel is set to 255. Meanwhile, the values of other pixels (pixels with category label information of 0 and 1) in the first semantic segmentation map are set to 0, and a second segmentation map is obtained.

The specific flow of obtaining the third segmentation map and the fourth segmentation map based on the second semantic segmentation map may be referred to in the foregoing, and in order to avoid repetition, details are not described herein.

In the embodiment of the application, a first table image and a second table image are input into a semantic segmentation model to obtain a first semantic segmentation map corresponding to the first table image and a second semantic segmentation map corresponding to the second table image; the first segmentation map and the second segmentation map are acquired based on the first semantic segmentation map, and the third segmentation map and the fourth segmentation map are acquired based on the second semantic segmentation map. By the method, the first form image and the second form image can be subjected to semantic segmentation, the target frame line, the text edge and the background can be distinguished, the mutual interference between the text edge and the target frame line is reduced, and the accuracy of form identification is improved.

Optionally, as shown in fig. 2, in some embodiments, the semantic segmentation model includes a first downsampling network, a second downsampling network, and an upsampling network:

the first downsampling network is used for downsampling an input image in a third direction and a fourth direction to obtain a first feature image, the third direction is parallel to the target frame line, the fourth direction is perpendicular to the target frame line, the second downsampling network is used for downsampling the first feature image in the third direction to obtain a second feature image, the upsampling network is used for upsampling the second feature image to obtain an output image, and the size of the output image is identical to that of the input image;

wherein the output image comprises the first semantic segmentation map in case the input image comprises the first table image and the second semantic segmentation map in case the input image comprises the second table image.

The specific structure of the first downsampling network is not limited herein. Illustratively, in some embodiments, the first downsampling network includes a downsampling convolution layer. In other embodiments, the first downsampling network comprises at least two downsampling convolution layers connected in sequence.

The specific structure of the second downsampling network is not limited herein. Illustratively, in some embodiments, the second downsampling network includes a downsampling convolution layer. In other embodiments, the second downsampling network includes at least two downsampling convolution layers connected in sequence.

It should be noted that, the first downsampling network and the second downsampling network downsamples the input image, but the downsampling directions of the first downsampling network and the second downsampling network are different. The first downsampling network is for downsampling the input image in a third direction and a fourth direction, and the second downsampling network is for downsampling the first feature map in the third direction.

The upsampling network is used to upsample the second feature map so that the size of the output image is restored to be the same as the size of the input image. The specific structure of the upsampling network is not limited herein, and in actual use, the structure of the upsampling network may be set and adjusted according to the structures of the first downsampling network and the second downsampling network.

For ease of understanding, specific examples will be described below.

In the case where the target frame line includes a horizontal frame line, the semantic segmentation model is used to extract features of the horizontal edge. The third direction may be understood as a horizontal direction and the fourth direction as a vertical direction. The first downsampling network downsamples the input image in the horizontal direction and the vertical direction simultaneously, so that the size of the input image is reduced, and the calculation efficiency of the semantic segmentation model is improved. The second downsampling network downsamples the first feature map in the horizontal direction, meanwhile, the scale of the first feature map in the vertical direction is kept unchanged, the feature in the vertical direction is prevented from being excessively compressed, meanwhile, the receptive field in the horizontal direction is enlarged, the extraction effect of the feature in the horizontal direction is improved, and therefore the extraction performance of the semantic segmentation model on the horizontal edge is guaranteed.

In the case where the target wire comprises a vertical wire, the semantic segmentation model is used to extract features of the vertical edge. The third direction may be understood as a vertical direction and the fourth direction may be understood as a horizontal direction. The first downsampling network downsamples the input image in the horizontal direction and the vertical direction simultaneously, so that the size of the input image is reduced, and the calculation efficiency of the semantic segmentation model is improved. The second downsampling network downsamples the first feature map in the vertical direction, meanwhile, the dimension of the first feature map in the horizontal direction is kept unchanged, the feature in the horizontal direction is prevented from being excessively compressed, meanwhile, the receptive field in the vertical direction is enlarged, the extraction effect of the feature in the vertical direction is improved, and therefore the extraction performance of the semantic segmentation model on the vertical edge is guaranteed.

In an embodiment of the application, the semantic segmentation model comprises a first downsampling network, a second downsampling network and an upsampling network. Through the arrangement of the first downsampling network, the size of an input image can be reduced, the calculated amount of the semantic segmentation model is reduced, and the calculation efficiency of the semantic segmentation model is improved. Through the setting of the second downsampling network, the extraction performance of the semantic segmentation model on the characteristics of the horizontal edge or the vertical edge can be ensured. Through the first downsampling network and the second downsampling network, different sampling rates can be set for different directions, so that the extraction performance of the semantic segmentation model on the characteristics of the horizontal edge or the vertical edge is guaranteed.

In some embodiments, the first table image and the second table image may be input into the semantic segmentation model simultaneously. In this embodiment, the input image corresponding to the first downsampling network includes a first table image and a second table image; the output image corresponding to the up-sampling network comprises a first semantic segmentation map and a second semantic segmentation map.

In other embodiments, the first table image and the second table image may also be input into the semantic segmentation model, respectively. In this embodiment, the input image corresponding to the first downsampling network includes a first table image or a second table image; the output image corresponding to the up-sampling network comprises a first semantic segmentation map or a second semantic segmentation map.

Optionally, in some embodiments, the inputting the first table image and the second table image into the semantic segmentation model to obtain a first semantic segmentation map corresponding to the first table image and a second semantic segmentation map corresponding to the second table image includes:

performing image amplification on the first table image and the second table image to obtain a first amplified image and a second amplified image, wherein the size of the first amplified image is the same as that of the second amplified image;

Inputting the first amplified image and the second amplified image into the semantic segmentation model to obtain a first semantic segmentation map corresponding to a first table image and a second semantic segmentation map corresponding to a second table image.

It should be understood that the specific manner in which the first and second table images are obtained by performing image amplification is not limited herein. For convenience of description, the size of the first form image is noted as (w ₁ ,h ₁ ) The size of the second form image is noted as (w ₂ ,h ₂ ). The first amplified image and the second amplified image are each of a size (w _y ,h _y )。

Illustratively, in some embodiments, w _y Is a preset value, and w _y Greater than w ₁ And w ₂ ；h _y Is a preset value, and h _y Is greater than h ₁ And h ₂ . In other embodiments, w _y Is w ₁ And w ₂ Maximum value of (h) _y Is h ₁ And h ₂ Is the maximum value of (a).

In the embodiment of the application, the first table image and the second table image are subjected to image amplification to obtain a first amplified image and a second amplified image with the same size, and the first semantic segmentation map and the second semantic segmentation map are obtained by inputting the first amplified image and the second amplified image into the semantic segmentation model at the same time. By the method, the first semantic segmentation map and the second semantic segmentation map can be acquired at the same time, and the table recognition efficiency is improved.

Optionally, in some embodiments, the table block includes a plurality of the horizontal frame lines and a plurality of the vertical frame lines, a plurality of the horizontal frame lines and a plurality of the vertical frame lines are enclosed to form a plurality of cells, and the table structure information includes structure information of the plurality of cells;

the determining, based on the table block diagram, table structure information corresponding to the first table image includes:

determining a first serial number of each horizontal wire based on coordinate values of a plurality of horizontal wires in a pre-established rectangular coordinate system, and determining a second serial number of each vertical wire based on coordinate values of a plurality of vertical wires in the rectangular coordinate system;

the method comprises the steps of determining structural information of each cell in the cells, wherein the structural information comprises a first sub-sequence number, a second sub-sequence number, a third sub-sequence number and a fourth sub-sequence number, the first sub-sequence number and the second sub-sequence number are first sequence numbers surrounding to form a horizontal frame line of the cell, and the third sub-sequence number and the fourth sub-sequence number are second sequence numbers surrounding to form a vertical frame line of the cell.

The plurality of horizontal frame wires and the plurality of vertical frame wires are enclosed to form a plurality of cells, and for any one cell, the periphery of the cell is formed by two horizontal frame wires and two vertical frame wires, so that the position of the cell and the size of the cell can be determined by enclosing the two horizontal frame wires and the two vertical frame wires forming the cell.

For ease of understanding, a specific example will be described below.

A rectangular coordinate system is established in advance, and a first serial number of each horizontal frame line and a second serial number of each vertical frame line are determined according to the coordinate value of each horizontal frame line in the rectangular coordinate system. Illustratively, the first sequence number is denoted row and the second sequence number is denoted col. And setting the value of row of each horizontal frame line in sequence according to the coordinate value of the horizontal frame line in the longitudinal axis direction of the rectangular coordinate system, wherein the value of row is 1, 2, … … and n in sequence. Similarly, according to the coordinate values of the vertical frame lines in the transverse axis direction of the rectangular coordinate system, the values of col of each vertical frame line are sequentially set, wherein the values of col are sequentially 1, 2, … … and m. Wherein n and m are both positive integers.

For convenience of description, structural information of the cell is denoted as (span_cols, span_rows), where span _cols ＝[col _left ,col _right ]，span_rows＝[row _top ,row _bottom ]。col _left Second serial number, col, representing vertical frame line on left side of cell _right Second serial number, row, representing vertical frame line on right side of cell _top First serial number, row, representing upper horizontal frame line of cell _bottom Representing the first serial number of the horizontal frame line at the lower side of the cell. In the present embodiment, row _top And row _bottom It can be understood that the first sub-sequence number and the second sub-sequence number, col _left And col _right It is understood that the third sub-sequence number and the fourth sub-sequence number.

Referring to fig. 3, a cell a is taken as an example and a cell B is taken as an example. The structural information of cell a is as follows: span _cols ＝[1,2]，span_rows＝[3,2]. The structural information of the cell B is as follows: span _cols ＝[2,3]，span_rows＝[3,1]. As can be seen from the structural information of the cell A and the cell B, the cell A is positioned at the left side of the cell B, and the size of the cell A is smaller than Yu Shanyuan cell BThe size, cell B, may be a merged cell.

By the method, the relative positions and the sizes of the cells in the table can be determined through the structural information of the cells, and the cells with different sizes can be identified, so that the accuracy and the convenience of identifying the table structure are improved.

Optionally, in some embodiments, the step 102 includes:

combining the first target image and the second target image to obtain a table block diagram;

determining table structure information corresponding to the first table image based on the table block diagram;

and determining text information corresponding to the first table image based on the table structure information, the third target image and the fourth target image.

The table block diagram is obtained by combining the first target image and the second target image, so that the table block diagram comprises information in the first target image and information in the second target image. Specifically, the table block includes horizontal frame lines in the first table image and vertical frame lines in the first table image, and a plurality of cells can be divided by interleaving the horizontal frame lines and the vertical frame lines.

The table structure information corresponding to the first table image may be determined based on the table block diagram. The specific content of the table structure information is not limited herein. Illustratively, in some embodiments, the table structure information includes the number of horizontal wires, the number of vertical wires, the number of cells, the size of each cell, and the like.

In the embodiment of the application, the first target image and the second target image are combined to obtain a table block diagram; determining table structure information corresponding to the first table image based on the table block diagram; and determining the text information corresponding to the first table image based on the table structure information, the third target image and the fourth target image. Through the arrangement, after the table structure information is obtained, the text information is determined based on the table structure information, so that the convenience in positioning the text information can be improved, and the accuracy of the text information can be improved.

Optionally, in some embodiments, the determining text information corresponding to the first table image based on the table structure information, the third target image and the fourth target image includes:

combining the third target image and the fourth target image to obtain a character feature map, wherein the character feature map comprises a character area;

acquiring a text field area based on the table structure information and coordinate values of the text area in the rectangular coordinate system, wherein the text field area comprises text areas which are positioned in the same cell and positioned in the same row;

acquiring a text field image corresponding to the text field region from the first table image 5 based on the coordinate value of the text field region in the rectangular coordinate system;

and inputting the text field image into a pre-trained character recognition model to obtain a character recognition result, wherein the character information comprises the character recognition result and coordinate values of the text field region in the rectangular coordinate system.

It should be understood that the text region may be divided 0 into a plurality of text sub-regions, each within a cell, based on the table structure information and the coordinate values of the text region. Further, based on the coordinate values of the text sub-regions, a plurality of lines of text located in the same unit cell may be divided into text field regions.

Based on the coordinate value of the text field area, determining a text field image corresponding to the coordinate value in the first table image, and inputting the text field image into a pre-trained character recognition model to obtain a character recognition 5 recognition result.

For ease of understanding, the following will exemplify. Please refer to fig. 3 and fig. 4.

And combining the third target image and the fourth target image to obtain a character feature map, wherein the character feature map comprises a character area. By sitting the text region in the rectangular coordinate system

Comparing the index value with the table structure information obtained according to fig. 3 can determine that the cell B and the cell C0 each include a partial text region.

As can be seen from the coordinate values of the text regions located in the cell B, the cell B contains two rows of text, so that the text regions located in the cell B can be further split to obtain the text field region 1 and the text field region 2. As can be seen from the coordinate values of the text field region located in the cell C, only one line of text is contained in the cell C, and thus the text field region 3 can be obtained.

5 based on the coordinate values of the text field area 1, the text field area 2 and the text field area 3, searching the area corresponding to the same coordinate value in the first table image, and further obtaining the text field image 1 corresponding to the text field area 1, the text field image 2 corresponding to the text field area 2 and the text field image 3 corresponding to the text field area 3.

And inputting the text field image 1, the text field image 2 and the text field image 3 into a pre-trained character recognition model to obtain corresponding character recognition results respectively.

In the embodiment of the application, a text field area is obtained based on table structure information and coordinate values of the text area in a rectangular coordinate system, the text field area comprises text areas which are positioned in the same unit cell and positioned in the same row, and a text field image corresponding to the text field area is obtained from a first table image based on the coordinate values of the text field area in the rectangular coordinate system; and inputting the text field image into a pre-trained character recognition model to obtain a character recognition result. By the method, the text detection function is realized by combining the table structure information, the processing efficiency can be improved, the mutual interference between the table frame lines and the text information is reduced, and the accuracy of table identification is improved.

The following describes a specific flow of the form identification method provided in the present application by taking a specific embodiment as an example. Please refer to fig. 5-8. FIG. 5 is a second flowchart of a table identifying method according to an embodiment of the present application.

The semantic segmentation model is trained in advance. In some embodiments, as shown in fig. 6, an edge detection operator (for example, canny algorithm) in the related technology is utilized to detect a horizontal edge of the sample image, data pre-labeling is achieved through horizontal filtering, and finally, a pre-labeling result is checked manually, so that a final labeled sample image is obtained. The sample image is marked through the method, a training data set is constructed, and the training data set is utilized to train the semantic segmentation model, so that a final semantic segmentation model is obtained. By the method, the labeling efficiency can be improved, and the data set can be enlarged. In addition, in the specific implementation, in order to improve model generalization, various table styles can be supplemented through data generation, and the specific mode is not described herein.

In this embodiment, the structure of the semantic segmentation model is shown in fig. 7 a-7 c. Fig. 7b is a schematic diagram of a convolution stage (conv stage), and fig. 7c is a schematic diagram of an upsampling stage (upconv stage). The input image size of the semantic segmentation model is 768×768, an edge segmentation result line_mask is extracted from a depth feature image with 1/1 original size, the line_mask is a 3-channel image with the size of 768×768, and the value line_mask of each position _i,j A corresponding semantic segmentation map obtained after argmax representing the class probability distribution of this location (or called pixel). Wherein, line_mask _i,j A value of 0 represents this position as background, a value of 1 represents this position as horizontal line, and a value of 2 represents this position as horizontal edge of the text.

In order to ensure the extraction performance of the model on the horizontal edge, the size of the convolution kernel is modified, so that the input image is subjected to downsampling in the x direction only in the second downsampling network after being subjected to downsampling for 2 times in the y direction in the first downsampling network, and meanwhile, the scale of the characteristic diagram in the y direction is kept unchanged.

Referring to fig. 8, the first table image is rotated by 90 ° to obtain a second table image, and the first table image and the second table image are subjected to scale expansion to obtain a first expansion image and a second expansion image. The size of the first tabular image is noted as (w ₁ ,h ₁ ) The size of the second form imageIs denoted as (w) ₂ ,h ₂ ) The first and second amplified images are each of a size (w _y ,h _y )。w _y Equal to h _y And is w ₁ 、w ₂ 、h ₁ And h ₂ Is the maximum value of (a).

Inputting the first amplified image and the second amplified image into a semantic segmentation model shown in fig. 6 to obtain a first semantic segmentation map and a second semantic segmentation map. For the first semantic segmentation graph, extracting line_mask _i,j A position of 1, setting the value of the position to 255, and setting the values of other positions to 0 to obtain a first segmentation map; for the second semantic segmentation graph, extracting line_mask _i,j A position of 1, and the value of the position is set to 255, and the values of other positions are set to 0, to obtain a third division map.

The first segmentation map is a binary map, wherein each connected domain represents a horizontal frame line in the first table image, an average ordinate value of the connected domains in a pre-established rectangular coordinate system is obtained, and the connected domains are ordered according to the ordinate, and the corresponding sequence number is a first sequence number of the horizontal frame line and is recorded as row. The third segmentation graph is a binary graph, wherein each connected domain represents a vertical frame line in the first table image, an average abscissa value of the connected domains in a pre-established rectangular coordinate system is obtained, and the connected domains are ordered according to the abscissa, and a second serial number with the corresponding serial number of the vertical frame line is recorded as col.

And rotating the third segmentation map by-90 degrees, and then taking a union set of the first segmentation map and the third segmentation map to obtain a table block diagram. Looking up the inner outline of the table block diagram to obtain the position of each cell and the range serial numbers of the horizontal frame line and the vertical frame line spanned by the cell, and obtaining the table structure information (or called cell information). And ordering the cells based on the cell information, and outputting html structured information.

For the first semantic segmentation graph, extracting line_mask _i,j 2, setting the value of the position to 255, and setting the values of other positions to 0 to obtain a second segmentation map; for the second semantic segmentation graph, extracting line_mask _i,j A position of 2, and the value of the position is set to 255, and the values of other positions are set to 0, thereby obtainingAnd a fourth segmentation map.

The second segmentation map and the fourth segmentation map are binary maps, and the second segmentation map and the fourth segmentation map are combined to obtain a character feature map. Performing horizontal projection of the binary image on each cell area to obtain a text field area; and then, obtaining an external rectangle for each text field area, and obtaining coordinate values of the external rectangle. And acquiring a corresponding text field image from the first table image based on the coordinate values of the circumscribed rectangle. And identifying the text field image based on a convolutional cyclic neural network (Convolutional Recurrent Neural Network, CRNN) to obtain a text identification result. The text information comprises a text recognition result and coordinate values of the text field area in the rectangular coordinate system. After the character recognition result is obtained, the characters can be output in a structuring way, namely, the table-form character information is output by combining the table structure information and the character recognition structure.

In some embodiments, on the basis of the public data set, more fonts and richer corpus information are obtained through a generation algorithm to supplement data, so that a printing body character recognition model with better performance is trained.

In this embodiment, for a form image with a frame line, the semantic segmentation model may implement extraction of horizontal edge features, while distinguishing two categories of horizontal frame lines and text horizontal edges. Meanwhile, the semantic segmentation model is considered to be used for realizing the detection of the horizontal edge characteristics, so that different sampling rates are designed for the horizontal direction and the vertical direction during downsampling, and the accuracy of the horizontal edge detection is ensured.

The embodiment of the invention also provides a table identification device. Referring to fig. 9, fig. 9 is a block diagram of a table identifying apparatus according to an embodiment of the present invention. Since the principle of the table recognition device for solving the problem is similar to that of the table recognition method in the embodiment of the present invention, the implementation of the table recognition device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 9, the table identifying apparatus 900 includes:

the first obtaining module 901 is configured to obtain a first target image, a second target image, a third target image, and a fourth target image corresponding to the first table image based on a pre-trained semantic segmentation model;

A second obtaining module 902, configured to obtain table structure information and text information corresponding to the first table image based on the first target image, the second target image, the third target image, and the fourth target image;

Optionally, the second obtaining module 902 includes:

the merging processing unit is used for merging the first target image and the second target image to obtain a table block diagram;

a first determining unit, configured to determine, based on the table block diagram, table structure information corresponding to the first table image;

and the second determining unit is used for determining character information corresponding to the first table image based on the table structure information, the third target image and the fourth target image.

Optionally, the first obtaining module 901 includes:

the rotating unit is used for rotating the first form image along the first direction by a target angle to obtain a second form image;

an obtaining unit, configured to obtain a first segmentation map and a second segmentation map corresponding to the first table image, and obtain a third segmentation map and a fourth segmentation map corresponding to the second table image, based on the semantic segmentation model;

a third determining unit, configured to determine a first target image, a second target image, a third target image, and a fourth target image corresponding to the first table image based on the first segmentation map, the second segmentation map, the third segmentation map, and the fourth segmentation map;

Optionally, in the case that the target wire is the horizontal wire, the third determining unit is specifically configured to:

Optionally, the acquiring unit includes:

the first input subunit is used for inputting the first table image and the second table image into the semantic segmentation model to obtain a first semantic segmentation map corresponding to the first table image and a second semantic segmentation map corresponding to the second table image;

the first acquisition subunit is configured to acquire a first segmentation map and a second segmentation map corresponding to the first table image based on the first semantic segmentation map, and acquire a third segmentation map and a fourth segmentation map corresponding to the second table image based on the second semantic segmentation map.

Optionally, the semantic segmentation model includes a first downsampling network, a second downsampling network, and an upsampling network:

Optionally, the first input subunit is specifically configured to:

Optionally, the table block includes a plurality of horizontal frame lines and a plurality of vertical frame lines, a plurality of cells are formed by enclosing a plurality of horizontal frame lines and a plurality of vertical frame lines, and the table structure information includes structure information of the plurality of cells;

the first determination unit includes:

a first determining subunit configured to determine a first serial number of each of the horizontal frame lines based on coordinate values of the plurality of horizontal frame lines in a pre-established rectangular coordinate system, and determine a second serial number of each of the vertical frame lines based on coordinate values of the plurality of vertical frame lines in the rectangular coordinate system;

the second determining subunit is configured to determine structural information of each of the plurality of cells, where the structural information includes a first sub-sequence number, a second sub-sequence number, a third sub-sequence number, and a fourth sub-sequence number, where the first sub-sequence number and the second sub-sequence number are first sequence numbers surrounding a horizontal frame line forming the cell, and the third sub-sequence number and the fourth sub-sequence number are second sequence numbers surrounding a vertical frame line forming the cell.

Optionally, the second determining unit includes:

the merging sub-processing unit is used for merging the third target image and the fourth target image to obtain a character feature map, wherein the character feature map comprises a character area;

The second obtaining subunit is used for obtaining a text field area based on the table structure information and coordinate values of the text area in the rectangular coordinate system, wherein the text field area comprises text areas which are positioned in the same cell and positioned in the same row;

a third obtaining subunit, configured to obtain, from the first table image, a text field image corresponding to the text field area based on coordinate values of the text field area in the rectangular coordinate system;

and the second input subunit is used for inputting the text field image into a pre-trained character recognition model to obtain a character recognition result, and the character information comprises the character recognition result and coordinate values of the text field area in the rectangular coordinate system.

The form identification device 900 provided in the embodiment of the present invention may perform the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

As shown in fig. 10, the embodiment of the present invention further provides an electronic device 1000, including a processor 1001, a memory 1002, and a program or an instruction stored in the memory 1002 and capable of running on the processor 1001, where the program or the instruction implements each process of the embodiment of the method shown in fig. 1 when being executed by the processor 1001, and the same technical effects are achieved, and for avoiding repetition, a detailed description is omitted herein.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the method embodiment shown in fig. 1, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated here.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A form identification method, comprising:

2. The method of claim 1, wherein the obtaining the table structure information and the text information corresponding to the first table image based on the first target image, the second target image, the third target image, and the fourth target image includes:

3. The method of claim 1, wherein the acquiring the first, second, third, and fourth target images corresponding to the first table image based on the pre-trained semantic segmentation model comprises:

4. The method of claim 3, wherein, in the case where the target frame line is the horizontal frame line, the determining the first, second, third, and fourth target images based on the first, second, third, and fourth segmentation maps comprises:

5. The method of claim 3, wherein the obtaining a first segmentation map and a second segmentation map corresponding to the first table image and a third segmentation map and a fourth segmentation map corresponding to the second table image based on the semantic segmentation model comprises:

6. The method of claim 5, wherein the semantic segmentation model comprises a first downsampling network, a second downsampling network, and an upsampling network:

7. The method of claim 5, wherein the inputting the first table image and the second table image into the semantic segmentation model to obtain a first semantic segmentation map corresponding to the first table image and a second semantic segmentation map corresponding to the second table image comprises:

8. The method of claim 2, wherein the table block comprises a plurality of the horizontal frame wires and a plurality of the vertical frame wires, a plurality of the horizontal frame wires and a plurality of the vertical frame wires are enclosed to form a plurality of cells, and the table structure information comprises structure information of the plurality of cells;

9. The method of claim 8, wherein the determining text information corresponding to the first table image based on the table structure information, the third target image, and the fourth target image comprises:

acquiring a text field image corresponding to the text field region from the first table image based on the coordinate value of the text field region in the rectangular coordinate system;

10. A form identification device, comprising:

11. An electronic device, comprising: a memory, a processor, and a program stored on the memory and executable on the processor; it is characterized in that the method comprises the steps of,

the processor for reading a program in a memory to implement the steps in the method according to any one of claims 1 to 9.

12. A readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the method according to any one of claims 1 to 9.