CN115116080A

CN115116080A - Table analysis method and device, electronic equipment and storage medium

Info

Publication number: CN115116080A
Application number: CN202210781115.3A
Authority: CN
Inventors: 于海鹏; 李煜林; 杨夏浛; 钦夏孟; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-09-27

Abstract

The application discloses a form analysis method and device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The implementation scheme is as follows: acquiring a form image to be processed; coding the table image to obtain a characteristic diagram corresponding to the table image; decoding the feature map to obtain first position information of each text detection box in the form image and text content in each text detection box; determining first row information and column information corresponding to the text detection box according to the feature map and the first position information; and generating an analysis result corresponding to the table image according to the text content, the first line information and the column information. The method realizes the analysis of table contents and the analysis of table row information and column information, thereby realizing the structured analysis of the table image.

Description

Table analysis method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, can be applied to scenes such as Optical Character Recognition (OCR) and the like, and particularly relates to a form analysis method and device, electronic equipment and a storage medium.

Background

The document is an important way for storing information, wherein the table is the most common and intuitive information organization form in the document, and contains a lot of structured information of a user, and the acquisition of the structured information is helpful for constructing a huge database for storing and managing data.

In the related art, a document in an image form can be obtained, and how to analyze required structural information from a document table in the image form is a technical problem to be solved urgently.

Disclosure of Invention

The application provides a table parsing method and device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a table parsing method, including:

acquiring a form image to be processed;

coding the form image to obtain a feature map corresponding to the form image;

decoding the feature map to acquire first position information of each text detection box in the form image and text content in each text detection box;

determining first row information and column information corresponding to the text detection box according to the feature map and the first position information;

and generating an analysis result corresponding to the table image according to the text content, the first row information and the column information.

According to another aspect of the present application, there is provided a table parsing apparatus including:

the first acquisition module is used for acquiring a form image to be processed;

the second acquisition module is used for encoding the form image to acquire a feature map corresponding to the form image;

a third obtaining module, configured to decode the feature map to obtain first position information of each text detection box in the form image and text content in each text detection box;

the determining module is used for determining first row information and column information corresponding to the text detection box according to the feature map and the first position information;

and the generating module is used for generating an analysis result corresponding to the form image according to the text content, the first line information and the column information.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the above-described embodiments.

According to another aspect of the present application, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the method of the above-mentioned embodiment.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flow chart of a table parsing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a table parsing method according to another embodiment of the present application;

fig. 3 is a schematic flowchart of a table parsing method according to another embodiment of the present application;

fig. 4 is a schematic flowchart of a table parsing method according to another embodiment of the present application;

FIG. 5 is a table parsing process diagram according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a table parsing apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device for implementing a table parsing method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

Computer vision is a science for researching how to make a machine "see", and means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect.

Table parsing methods, apparatuses, electronic devices, and storage media according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a table parsing method according to an embodiment of the present application.

The form analysis method according to the embodiment of the present application may be executed by the form analysis apparatus according to the embodiment of the present application, and the apparatus may be configured in an electronic device, so as to obtain the first position information of each text detection box and the text content in each text detection box through a feature map, and determine the first row information and the column information corresponding to the text detection box according to the first position information of the text detection box and the feature map of the form image, thereby implementing a structured analysis on the form image.

The electronic device may be any device with computing capability, for example, a personal computer, a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and the like.

As shown in fig. 1, the table parsing method includes:

step 101, obtaining a form image to be processed.

The form image is an image including a form, and the form image may be directly obtained by photographing a paper form, a form displayed on a terminal, or the like, or may be extracted from a document image, which is not limited in the present application.

And 102, encoding the table image to acquire a characteristic diagram corresponding to the table image.

In the application, the table image can be encoded by adopting a neural network model, and the characteristic diagram corresponding to the table image is obtained.

When the method is implemented, the text detection and the text recognition of the form image can be completed by two models, namely, the text detection model is firstly utilized to perform the text detection on the form image, and then the model only used for performing the text recognition is utilized to perform the text recognition based on the text detection result, so that the form image can be input into the text detection model, and the text detection model can encode the form image to obtain the feature map.

Or, the text detection and the text recognition of the form image may also be completed by one model, and then the form image may be input into the text recognition model, and the text recognition model encodes the form image to obtain the feature map corresponding to the form image.

And 103, decoding the feature map to acquire first position information of each text detection box in the form image and text content in each text detection box.

According to the method and the device, the first position information of each text detection box in the form image and the text content in each text detection box can be determined according to the feature diagram.

The text detection box may be a quadrangle surrounding the text content in the table image, for example, the text detection box may be a quadrangle surrounding the text content in the cell in the table; the first position information of the text detection box may include coordinate information of four vertices of the quadrangle in the form image.

If the feature map is obtained by coding the text detection model, coding the feature map to obtain first position information of each text detection box in the form image, and then performing text recognition according to the first position information of each text detection box and the form image to obtain text content in each text detection box.

If the feature map is obtained by model coding for completing two tasks of text detection and text recognition, the feature map is decoded, and the first position information of each text detection box and the text content in each text detection box can be directly obtained.

And step 104, determining first row information and column information corresponding to the text detection box according to the feature map and the first position information.

In the application, the feature corresponding to the text detection box in the feature map can be determined according to the first position information of the text detection box, and the first row information and the column information of the text detection box are determined based on the feature corresponding to the text detection box.

The first row information and the column information of the text detection box refer to the number-th row and the number-th column of the text detection box in the table.

And 105, generating an analysis result corresponding to the table image according to the text content, the first row information and the column information.

In the application, the several rows and several columns of the text content in each text detection box in the table can be determined according to the text content in each text detection box and the first row information and the column information corresponding to each text detection box, so that the analysis result corresponding to the table image can be generated according to the first row information and the column information corresponding to the text content in each text detection box.

In the embodiment of the application, the first position information of each text detection box in the form image and the text content in each text detection box are obtained by decoding the feature map corresponding to the form image to be processed, and the first row information and the column information corresponding to the text detection box are determined according to the feature map and the first position information, so that the form analysis result of the form image is obtained by combining the text content in the text detection box, and the structured analysis of the form image is realized. In addition, the first line information and the column information corresponding to each text detection box are determined based on the first position information of each text detection box, so that the accuracy of positioning the boundaries between lines and between columns can be improved, and the accuracy of an analysis result can be further improved.

Fig. 2 is a flowchart illustrating a table parsing method according to another embodiment of the present application.

As shown in fig. 2, the table parsing method includes:

step 201, obtaining a form image to be processed.

Step 202, encoding the table image to obtain a feature map corresponding to the table image.

Step 203, decoding the feature map to obtain first position information of each text detection box in the form image and text content in each text detection box.

In the present application, steps 201 to 203 are similar to those described in the above embodiments, and therefore are not described herein again.

And 204, segmenting the feature map according to the first position information to obtain a first sub-feature corresponding to the text detection box.

In the application, a feature area corresponding to the first position information of each text detection box in the feature map can be determined according to the size of the form image and the size of the feature map, and the first sub-feature corresponding to each text detection box can be cut out from the feature map according to the feature area corresponding to the first position information of each text detection box in the feature map. Therefore, the first sub-feature corresponding to the text detection box is obtained based on the first position information of the text detection box, and the accuracy of the first sub-feature corresponding to the text detection box is improved.

For example, the size of the form image is the same as the size of the feature map, and a feature region corresponding to the first position information of the text detection box in the feature map may be used as a feature region where the first sub-feature corresponding to the text detection box is located.

Step 205, determining the first row information and the column information according to the first sub-feature.

In the application, when the column information corresponding to the text detection box and the first line information corresponding to the text detection box are determined according to the first sub-feature corresponding to the text detection box, the column information corresponding to the text detection box and the first line information corresponding to the text detection box can be determined in different manners.

If the attributes included in a certain type of table are located in the first row of the table, the attributes on each column can be regarded as a category, and when the table image of the type is analyzed, the category corresponding to each text detection box can be determined according to the first sub-feature corresponding to each text detection box, so that the column information corresponding to each text detection box can be determined according to the preset column information corresponding to each category. When the first line information corresponding to each text detection box is determined, the first sub-features belonging to the same line can be determined according to the distance between the first sub-features corresponding to each text detection box, and the line information corresponding to the first sub-features belonging to the same line can be determined according to the first position information of each text detection box, so that the first line information corresponding to each text detection box can be determined.

For example, a table of a certain type includes attributes located in the first row of the table, which are "name", "gender" and "age", and the columns of the three attributes are sequentially the 1 st column, the 2 nd column and the 3 rd column in the table, and the attributes on each column can be regarded as a category. Then, when the table of the type is analyzed, the category corresponding to each text detection box may be determined according to the first sub-feature corresponding to each text detection box in the table, so that the column information corresponding to each text detection box may be determined according to the column information corresponding to the three categories of "name", "gender", and "age".

If the attributes included in a certain type of table are located in the first column of the table, the attributes in each row can be regarded as a category, and when the type of table image is analyzed, the category corresponding to each text detection box can be determined according to the first sub-feature corresponding to each text detection box, so that the row information corresponding to each text detection box can be determined according to the preset row information corresponding to each category. When determining the column information corresponding to each text detection box, the first sub-features belonging to the same column may be determined according to the distance between the first sub-features corresponding to each text detection box, and the column information corresponding to the first sub-features belonging to the same column may be determined according to the first position information of each text detection box, so that the column information corresponding to each text detection box may be determined.

And step 206, generating an analysis result corresponding to the tabular image according to the text content, the first row information and the column information.

In the present application, the content of step 206 is similar to that described in the above embodiments, and therefore, the description thereof is omitted.

In the embodiment of the application, when the first line information and the column information corresponding to the text detection box are determined according to the feature map and the first position information, the feature map can be segmented according to the first position information of the text detection box to obtain the first sub-feature corresponding to the text detection box, and the first line information and the column information corresponding to the text detection box are determined according to the first sub-feature corresponding to the text detection box, so that the analysis of the column information and the row information of the text detection box can be realized based on the first sub-feature corresponding to the text detection box, and the analysis of the line information and the column information is performed based on the first sub-feature corresponding to the text detection box, so that the accuracy of a table analysis result is improved.

Fig. 3 is a flowchart illustrating a table parsing method according to another embodiment of the present application.

As shown in fig. 3, the table parsing method includes:

step 301, obtaining a form image to be processed.

Step 302, encoding the form image to obtain a feature map corresponding to the form image.

Step 303, decoding the feature map to obtain first position information of each text detection box in the form image and text content in each text detection box.

And step 304, segmenting the feature map according to the first position information to obtain a first sub-feature corresponding to the text detection box.

In the present application, steps 301 to 304 are similar to those described in the above embodiments, and therefore are not described herein again.

Step 305, the first sub-feature is classified into column attributes to determine column information.

In the application, if the table in the table image to be processed is a type of table whose attribute is located in the first row of the table, the column attribute of the first sub-feature corresponding to the text detection box can be classified, the probability that the text content in the text detection box belongs to each category is determined, and the category to which the text content in the text detection box belongs is determined from a plurality of categories according to the probability that the text content in the text detection box belongs to each category, so that the column information corresponding to the text detection box can be determined according to the column to which the category to which the text content in the text detection box belongs is located.

In practical applications, since the lengths of the text contents in different cells in the table may be different, the sizes of the text detection boxes may be different, and correspondingly, the sizes of the first sub-features corresponding to the text detection boxes may also be different.

Based on this, in the application, when determining the probability that the text content in the text detection box belongs to each category, the first sub-feature corresponding to each text detection box may be processed into a feature with a uniform size and subjected to one-dimensional vectorization, and then a feature matrix obtained after the one-dimensional vectorization corresponding to each text detection box is multiplied by a preset matrix, where the obtained matrix includes the probability that the text content in each text detection box belongs to each category. The preset matrix may be obtained by model training.

For example, there are 3 column attributes of a certain type of table, that is, there are 3 categories, an image of a certain table of the certain type of table is analyzed, first sub-features corresponding to n text detection boxes of the certain type of table may be unified to 10 × 10 in size and subjected to one-dimensional vectorization, so that a matrix of n × 100 may be obtained, then the matrix of n × 100 and the matrix of 100 × 3 are multiplied to obtain a matrix of n × 3, that is, each text detection box corresponds to a vector of 3 dimensions, and each element value in the vector of 3 dimensions represents a probability that text content in each text detection box belongs to each category.

And step 306, determining the first row of information according to the first sub-characteristics in a clustering mode.

In the application, during model training, the distance between the features of the text content in the same line can be smaller, and the distance between the features of the text content in different lines is larger, so that in a model application stage, the first line information corresponding to each text detection box can be determined through clustering according to the first sub-features corresponding to each text detection box in the form image.

As a possible implementation manner, the first sub-features corresponding to the text detection boxes may be clustered to obtain at least two first-class clusters, where the text detection boxes corresponding to the first sub-features in each first-class cluster belong to the same line, and then the second line information corresponding to each first-class cluster may be determined according to the second position information of the text detection box corresponding to the first sub-features in each first-class cluster, and the first line information corresponding to the text detection box corresponding to the first sub-features in each first-class cluster may be determined according to the second line information corresponding to each first-class cluster, so that the first line information of each text detection box in the form image may be determined. The second position information may refer to position information of a text detection box corresponding to the first sub-feature in the first cluster in the form image; the second line of information is the same as the first line of information.

For example, a table includes 9 text detection boxes, 3 clusters are obtained by clustering first sub-features corresponding to the 9 text detection boxes, it is described that the table has 3 rows, each cluster corresponds to one row in the table, and according to second position information of the text detection box corresponding to the first sub-features in the 3 clusters, second row information corresponding to each cluster can be determined, for example, a 1 st row of the table corresponding to a certain cluster indicates that the text detection box corresponding to the first sub-features in the cluster is located in the 1 st row of the table.

Therefore, the first line information corresponding to each text detection box can be determined by clustering the first sub-features corresponding to the text detection boxes, and the analysis of the table line information is realized. In addition, the first sub-features of the text detection boxes belonging to the same line are clustered together, so that the first line information corresponding to the text detection boxes is determined, and the accuracy of the table line information analysis result is improved.

As another possible implementation manner, the first sub-feature corresponding to the text detection box may be mapped to a multi-dimensional space to obtain a second sub-feature corresponding to the text detection box, then the second sub-feature corresponding to each text detection box is clustered to obtain at least two second clusters, the third line information corresponding to each second cluster is determined according to the second position information of the text detection box corresponding to the second sub-feature in each second cluster, and the first line information corresponding to the text detection box corresponding to the second sub-feature in each second cluster is determined according to the third line information corresponding to each second cluster, so that the first line information of each text detection box in the form image can be determined. The third position information may be position information of a text detection box corresponding to the second sub-feature in the second class cluster in the form image; the third line of information is the same as the first line of information.

For example, the first sub-features corresponding to each text detection box of a certain table form a three-dimensional matrix of k × 5 × 2, where k represents the number of text detection boxes, the last two dimensions in the three-dimensional matrix may be spliced into one dimension to obtain a matrix of k × 10, and then the matrix of k × 10 may be multiplied by a mapping matrix of 10 × 20 obtained by model learning to obtain a matrix of k × 20, that is, the second sub-features corresponding to each text detection box may be represented by a vector with a length of 20 dimensions. Then, clustering may be performed on the second sub-features corresponding to the k text detection boxes to cluster the second sub-features corresponding to the text detection boxes belonging to the same line together, so as to obtain at least two second-type clusters, where each second-type cluster corresponds to a line in the table.

Therefore, the first sub-features corresponding to the text detection boxes are mapped to the multi-dimensional space to obtain the second sub-features corresponding to the text detection boxes, then the second sub-features are clustered to determine the first line information corresponding to each text detection box, so that the table line information is analyzed, the second sub-features belonging to the same line of text detection boxes are clustered together to determine the first line information corresponding to each text detection box, and the accuracy of the table line information analysis result is improved. In addition, the first sub-features are mapped to the multi-dimensional space, so that the second sub-features are easier to distinguish than the first sub-features during clustering, and the accuracy of the table row information analysis result is further improved.

In this application, when Clustering the first sub-feature or the second sub-feature corresponding to each text detection box, a variety of Clustering algorithms may be used for Clustering, such as K-means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and the like, which is not limited in this application.

Step 307, generating an analysis result corresponding to the table image according to the text content, the first row information and the column information.

In the present application, step 307 is similar to the content described in the above embodiment, and therefore is not described herein again.

In the embodiment of the application, when the first column information and the row information corresponding to the text detection box are determined according to the first sub-feature corresponding to the text detection box, the column information corresponding to the text detection box is determined by classifying the column attribute of the first sub-feature, and the first row information corresponding to the text detection box is determined according to the first sub-feature and in a clustering mode, so that the analysis of the table row information and the table column information is realized. In addition, the first line information corresponding to the text detection box is determined in a clustering mode, and the accuracy of the table line information analysis result is improved.

It should be noted that, for a table of a type in which an attribute is located in a first column of the table, a row attribute of a first sub-feature corresponding to each text detection box in the table image may be classified to determine row information corresponding to each text detection box, which is similar to the above classification of the column attribute and determination of column information corresponding to the text detection box, and therefore, details are not repeated here. For the analysis of the column information, the column information corresponding to the text detection box may be determined by clustering according to the first sub-feature corresponding to the text detection box, which is similar to the determination of the first row information corresponding to the text detection box by clustering, and therefore, the description is omitted here.

Fig. 4 is a flowchart illustrating a table parsing method according to another embodiment of the present application.

As shown in fig. 4, the table parsing method includes:

step 401, obtaining a form image to be processed.

In the present application, step 401 is similar to the content described in the above embodiments, and therefore is not described herein again.

Step 402, encoding the table image to obtain a feature map corresponding to the table image.

In the application, models can be respectively established for text detection and text recognition of the form image, and the form image can be encoded by using an encoding layer in the text detection model to obtain a characteristic diagram corresponding to the form image.

In step 403, the feature map is decoded to obtain first position information.

In the application, the feature map can be decoded by using a decoding layer in the text detection model to obtain the first position information of each text detection box.

And step 404, extracting image areas corresponding to the text detection boxes from the form image according to the first position information.

In the method and the device, the image area surrounded by each text detection box in the form image can be determined according to the first position information of each text detection box, so that the image area corresponding to each text detection box can be extracted from the form image. And the image area corresponding to the text detection box comprises an image of the text content.

Step 405, performing text recognition on the image area to obtain text content.

In the application, the image area corresponding to each text detection box can be respectively input into the text recognition model for text recognition, so as to obtain the text content in each text detection box, thereby realizing the analysis of the table content.

And step 406, determining first row information and column information corresponding to the text detection box according to the feature map and the first position information.

Step 407, generating an analysis result corresponding to the form image according to the text content, the first row information, and the column information.

In the present application, steps 406 to 407 are similar to those described in the above embodiments, and therefore are not described herein again.

In the embodiment of the application, the feature map corresponding to the form image may be decoded to obtain the first position information of each text detection box, and then, according to the first position information of each text detection box, the image area corresponding to each text detection box is extracted from the form image, and text recognition is performed on the image area corresponding to each text detection box to obtain the text content in each text detection box, thereby implementing parsing of the form content.

For convenience of understanding, the following description is made with reference to fig. 5, and fig. 5 is a schematic diagram of a table parsing process provided in an embodiment of the present application. The parsing process of table row information and column information is shown in fig. 5.

As shown in fig. 5, the table image to be processed may be input into a CNN (Convolutional Neural Networks) for encoding, so as to obtain a feature map corresponding to the table image, and the feature map is decoded, so as to obtain a text confidence matrix and a text region position matrix.

The text confidence coefficient matrix is a two-dimensional matrix, and each element in the text confidence coefficient matrix is used for representing the probability that the corresponding region contains text content; the text region position matrix is a three-dimensional matrix, and the third dimension contains 8 values, 8 values representing 8 coordinate values of four vertices of the corresponding region.

Then, the feature map may be segmented by using the location information of the region with the confidence level greater than the threshold in the text confidence level matrix, to obtain sub-features corresponding to 9 regions, respectively, where the 9 regions are the text detection boxes in the above embodiment, and each small rectangle in fig. 5 represents a first sub-feature corresponding to one text detection box.

It can be understood that, based on the text confidence matrix, an area with a confidence greater than a threshold value is determined, that is, a text detection box is determined, and the first sub-feature corresponding to each text detection box is segmented from the feature map by using the first position information of the text detection boxes.

Then, the first sub-features corresponding to the 9 text detection boxes are classified according to the column attributes to determine which category of the 3 categories "item", "quantity", and "amount" each text detection box belongs to, so that the column information corresponding to each text detection box can be determined according to the column in which each category belongs. Moreover, the first sub-features corresponding to the 9 text detection boxes can be mapped to a multidimensional space and then clustered, so that the first sub-features corresponding to the text detection boxes in the same row are clustered together to obtain 3 class clusters, wherein each class cluster corresponds to one row of the table.

In addition, according to the first position information of each text detection box, an image area corresponding to each text detection box can be extracted from the form image, and text recognition is performed on the image area corresponding to each text detection box, so that text content in each text detection box is obtained. Therefore, the analysis result of the table image can be generated according to the line information and the column information corresponding to each text detection box and the text content in each text detection box.

In order to implement the foregoing embodiments, an apparatus for table analysis is further provided in the embodiments of the present application. Fig. 6 is a schematic structural diagram of a table analysis device according to an embodiment of the present application.

As shown in fig. 6, the table analysis device 600 includes:

a first obtaining module 610, configured to obtain a form image to be processed;

a second obtaining module 620, configured to encode the table image to obtain a feature map corresponding to the table image;

a third obtaining module 630, configured to decode the feature map to obtain first position information of each text detection box in the form image and text content in each text detection box;

the determining module 640 is configured to determine, according to the feature map and the first position information, first row information and column information corresponding to the text detection box;

the generating module 650 is configured to generate an analysis result corresponding to the table image according to the text content, the first row information, and the column information.

In a possible implementation manner of this embodiment of the present application, the determining module 640 includes:

the acquiring unit is used for segmenting the feature map according to the first position information so as to acquire a first sub-feature corresponding to the text detection box;

and the determining unit is used for determining the first row information and the column information according to the first sub-characteristic.

In a possible implementation manner of the embodiment of the present application, the obtaining unit is configured to:

determining a characteristic region corresponding to the first position information in the characteristic diagram;

and segmenting the feature map according to the feature area to obtain a first sub-feature.

In a possible implementation manner of the embodiment of the present application, the determining unit is configured to:

classifying the first sub-features into column attributes to determine column information;

and determining the first row of information according to the first sub-characteristic in a clustering mode.

clustering the first sub-features to obtain at least two first clusters;

determining second line information corresponding to each first-class cluster according to second position information of the text detection box corresponding to the first sub-feature in each first-class cluster;

and determining the first line according to the second line information.

mapping the first sub-features to a multi-dimensional space to obtain second sub-features corresponding to the text detection boxes;

clustering the second sub-features to obtain at least two second clusters;

determining third row information corresponding to each second type cluster according to second position information of the text detection box corresponding to the second sub-feature in each second type cluster;

and determining the first line information according to the third line information.

In a possible implementation manner of the embodiment of the present application, the third obtaining module 630 is configured to:

decoding the characteristic diagram to obtain first position information;

extracting image areas corresponding to the text detection boxes from the form image according to the first position information;

and performing text recognition on the image area to acquire text content.

It should be noted that the explanation of the foregoing table analysis method embodiment is also applicable to the table analysis device of this embodiment, and therefore is not repeated herein.

In the embodiment of the application, the first position information of each text detection box in the form image and the text content in each text detection box are obtained by decoding the feature map corresponding to the form image to be processed, and the first row information and the column information corresponding to the text detection box are determined according to the feature map and the first position information, so that the form analysis result of the form image is obtained by combining the text content in the text detection box, and the structured analysis of the form image is realized. In addition, the first line information and the column information corresponding to each text detection box are determined based on the first position information of each text detection box, so that the accuracy of positioning the boundaries between lines and between columns can be improved, and the accuracy of the analysis result can be further improved.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 7, the device 700 includes a computing unit 701, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 702 or a computer program loaded from a storage unit 708 into a RAM (Random Access Memory) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An I/O (Input/Output) interface 705 is also connected to the bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 701 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 701 executes the respective methods and processes described above, such as the table parsing method. For example, in some embodiments, the table parsing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the table parsing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the table parsing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present application, there is also provided a computer program product, which when executed by an instruction processor in the computer program product, executes the table parsing method proposed in the above-mentioned embodiment of the present application.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A table parsing method, comprising:

acquiring a form image to be processed;

encoding the form image to obtain a characteristic diagram corresponding to the form image;

2. The method of claim 1, wherein the determining, according to the feature map and the first position information, first row information and column information corresponding to the text detection box comprises:

segmenting the feature map according to the first position information to obtain a first sub-feature corresponding to the text detection box;

determining the first row information and the column information according to the first sub-feature.

3. The method of claim 2, wherein the segmenting the feature map according to the first position information to obtain a first sub-feature corresponding to the text detection box comprises:

and segmenting the feature map according to the feature region to obtain the first sub-feature.

4. The method of claim 2, wherein said determining the first row information and the column information according to the first sub-feature comprises:

classifying the first sub-feature into a column attribute to determine the column information;

and determining the first line of information according to the first sub-characteristic in a clustering mode.

5. The method of claim 4, wherein the determining the first line of information according to the first sub-feature and by clustering comprises:

clustering the first sub-features to obtain at least two first clusters;

determining second line information corresponding to each first-class cluster according to second position information of a text detection box corresponding to a first sub-feature in each first-class cluster;

and determining the first line of information according to the second line of information.

6. The method of claim 4, wherein the determining the first line of information according to the first sub-feature and by clustering comprises:

mapping the first sub-feature to a multi-dimensional space to obtain a second sub-feature corresponding to the text detection box;

clustering the second sub-features to obtain at least two second clusters;

determining third row information corresponding to each second type cluster according to second position information of a text detection box corresponding to a second sub-feature in each second type cluster;

and determining the first line of information according to the third line of information.

7. The method as claimed in claim 1, wherein said decoding the feature map to obtain the first position information of each text detection box in the form image and the text content in each text detection box comprises:

decoding the feature map to acquire the first position information;

and performing text recognition on the image area to acquire the text content.

8. A table parsing apparatus, comprising:

and the generating module is used for generating an analysis result corresponding to the table image according to the text content, the first row information and the column information.

9. The apparatus of claim 8, wherein the means for determining comprises:

the obtaining unit is used for carrying out segmentation processing on the feature map according to the first position information so as to obtain a first sub-feature corresponding to the text detection box;

a determining unit, configured to determine the first row information and the column information according to the first sub-feature.

10. The apparatus of claim 9, wherein the obtaining unit is configured to:

determining a feature region corresponding to the first position information in the feature map;

11. The apparatus of claim 9, wherein the determining unit is to:

12. The apparatus of claim 11, wherein the determining unit is configured to:

clustering the first sub-features to obtain at least two first clusters;

13. The apparatus of claim 11, wherein the determining unit is configured to:

clustering the second sub-features to obtain at least two second clusters;

determining third line information corresponding to each second cluster according to second position information of a text detection box corresponding to a second sub-feature in each second cluster;

14. The apparatus of claim 8, wherein the third obtaining means is configured to:

decoding the feature map to acquire the first position information;

and performing text recognition on the image area to acquire the text content.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 7.