CN114639107A

CN114639107A - Table image processing method, apparatus and storage medium

Info

Publication number: CN114639107A
Application number: CN202210427478.7A
Authority: CN
Inventors: 庾悦晨; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-06-17
Anticipated expiration: 2042-04-21
Also published as: CN114639107B

Abstract

The disclosure provides a form image processing method, a form image processing device and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, image processing, deep learning and the like. The specific implementation scheme is as follows: when the form image is processed, first text information and first position information of each text box in the form image are obtained, second position information of a cell image to be recognized in the form image is determined, third position information matched with the second position information is obtained from the first position information, and second text information in the text box corresponding to the third position information is used as text content in the cell image to be recognized. Therefore, the text content in the cell image to be identified in the form image is conveniently determined based on the text information and the position information of the text box in the form image, the complexity of form image processing is reduced, and the accuracy of form image processing is improved.

Description

Table image processing method, apparatus and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies, and in particular, to the field of computer vision, image processing, deep learning, and the like, and in particular, to a method and an apparatus for processing a form image, and a storage medium.

Background

With the improvement of the office electronization degree, the document data originally stored in the form of paper is gradually changed into the form of image by the electronization means such as a scanner. The table image refers to an image containing a table.

In the related art, a more complex image processing flow is generally adopted to process the table image.

Disclosure of Invention

The present disclosure provides a method, apparatus, and storage medium for form image processing.

According to an aspect of the present disclosure, there is provided a form image processing method, the method including: acquiring first text information and first position information of each text box in a form image; determining second position information of a cell image to be identified in the form image; acquiring third position information matched with the second position information from the first position information; and taking the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

According to another aspect of the present disclosure, there is provided a form image processing apparatus, the apparatus including: the first acquisition module is used for acquiring first text information and first position information of each text box in the form image; the first determining module is used for determining second position information of the cell image to be identified in the form image; the second acquisition module is used for acquiring third position information matched with the second position information from the first position information; and the second determining module is used for taking the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the form image processing method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the form image processing method disclosed in the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the form image processing method of the present disclosure.

One embodiment in the above application has the following advantages or benefits:

when the form image is processed, first text information and first position information of each text box in the form image are obtained, second position information of a cell image to be recognized in the form image is determined, third position information matched with the second position information is obtained from the first position information, and second text information in the text box corresponding to the third position information is used as text content in the cell image to be recognized. Therefore, the text content in the cell image to be identified in the form image is conveniently determined based on the text information and the position information of the text box in the form image, the complexity of form image processing is reduced, and the accuracy of form image processing is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic illustration according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing the form image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Table image processing methods, apparatuses, and storage media according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a form image processing method according to a first embodiment of the present disclosure.

As shown in fig. 1, the form image processing method may include:

step 101, obtaining first text information and first position information of each text box in the form image.

The execution subject of the form image processing method of the present embodiment is a form image processing apparatus, which may be implemented by software and/or hardware, and the form image processing apparatus may be an electronic device or may be disposed in an electronic device.

The electronic device may include, but is not limited to, a terminal device, a server, and the like, and the embodiment does not specifically limit the electronic device.

In some embodiments of the present disclosure, after the table image is obtained, an Optical Character Recognition (OCR) Recognition may be performed on the table image by using a Character recognizer to determine text information and position information corresponding to each text box in the table image.

In some exemplary embodiments, the table image may be subjected to text detection to obtain each text box in the table image and position information corresponding to each text box, and the text boxes may be subjected to OCR recognition to obtain text information corresponding to each text box.

The text box refers to an image area corresponding to the text in the form image.

The image format of the form image in the embodiment of the present disclosure may include, but is not limited to: various pictures such as jpg, jpeg, ppm, bmp, png, screenshot, scanned files, PDF documents and the like. The image content in the form image can be, but is not limited to, a large amount of form material that has been authorized by the user in the documents of financial institutions such as banks, securities, funds, insurance, etc., business units, institutions, such as receipts, tickets, insurance policies, notices, confirmations, application forms, etc.

And 102, determining second position information of the cell image to be identified in the form image.

In an embodiment of the present disclosure, table line detection may be performed on the table image, and according to a table detection result, the cell image to be recognized in the table image and the second position information of the cell image to be recognized in the table image are determined.

And 103, acquiring third position information matched with the second position information from the first position information.

In some exemplary embodiments, a degree of matching between the second location information and each of the first location information described above may be calculated, and from each of the first location information, the first location information having the highest degree of matching is acquired as the third location information that matches the second location information.

In some exemplary embodiments, the text boxes are generally rectangular, and the first position information corresponding to any one of the text boxes may be represented by x1, x2, y1, and y2, where the coordinates of the upper left corner of the rectangle are (x1, y1), and the coordinates of the lower right corner of the rectangle are (x2, y 2).

In some exemplary embodiments, the second position information may be represented by x3, x3, y4, y4, wherein the coordinates of the upper left corner of the cell image are (x3, y3) and the coordinates of the lower right corner of the cell image are (x4, y 4).

In some embodiments, for each of the first location information, in a case where the above-described first location information includes two location coordinates, a first center coordinate of the two location coordinates may be calculated. In the case where the second position information includes two position coordinates, second center coordinates of two positions in the second position information may be calculated, then, a distance between the first center coordinate and the second center coordinate may be calculated, and a matching degree between the first position information and the second position information may be determined according to the distance between the first center coordinate and the second center coordinate.

And step 104, taking the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

When the form image is processed, the method for processing the form image according to the embodiment of the disclosure acquires first text information and first position information of each text box in the form image, determines second position information of a cell image to be recognized in the form image, acquires third position information matched with the second position information from the first position information, and takes second text information in the text box corresponding to the third position information as text content in the cell image to be recognized. Therefore, the text content in the cell image to be recognized in the form image is conveniently determined based on the text information and the position information of the text box in the form image, the complexity of form image processing is reduced, and the accuracy of form image processing is improved.

It is understood that, in some embodiments, in order to accurately determine the position information of the cell image to be recognized in the form image, the position information of the cell image to be recognized in the form image may be determined by combining the image feature map and the semantic feature map of the form image, and the following process for determining the second position information of the cell image to be recognized in the form image is exemplarily described with reference to fig. 2, and as shown in fig. 2, the process may include:

step 201, obtaining an image feature map of the form image.

In some exemplary embodiments, the feature extraction may be performed on the table image by a Convolutional Neural Network (CNN) in the encoder to obtain an image feature map of the table image.

It is to be understood that the size of the image feature map is smaller than the size of the form image.

The convolutional neural Network CNN may be, for example, a Residual Network (ResNet), a Visual Geometry Group (VGG) Network, etc., and this embodiment is not particularly limited to this, and in actual application, a convolutional neural Network that can perform feature extraction on a table image may be selected according to actual requirements.

The number of the image feature maps may be one or more, which is not limited in this embodiment.

It is understood that, when there are a plurality of image feature images, the sizes of the plurality of image feature images are different. For example, feature extraction may be performed on the table image by N convolutional layers sequentially connected in the convolutional neural network, and image feature maps of different sizes output by each convolutional layer may be obtained. Where N is an integer greater than 1, for example, N may be equal to 4, or 5, and this embodiment is not particularly limited in this respect.

Step 202, generating a semantic feature map of the form image according to the semantic features of the first text information and the first position information.

In some exemplary embodiments, after the first text information of each text box is obtained, the semantic features of each character in the corresponding text information may be determined through a semantic representation model, and then the semantic features corresponding to the text information are determined according to the semantic features of each character.

In some exemplary embodiments, according to the semantic features of each character, one possible implementation manner of determining the semantic features corresponding to the text information may be: and summing and averaging the semantic features of the characters to obtain the semantic features corresponding to the text information.

And step 203, determining second position information of the cell image to be identified in the form image according to the image feature map and the semantic feature map.

In some exemplary embodiments, after the image feature map and the semantic feature map are obtained, analysis may be performed in combination with the image feature map and the semantic feature map to obtain second location information of the cell image to be identified in the form image.

The second position information is used for representing the position information of the cell image to be recognized in the form image.

Based on the above embodiment, in the process of determining the second position information of the cell image to be identified in the form image according to the image feature map and the semantic feature map, the sizes of the image feature map and the semantic feature map are generally required to be consistent. In order to improve the efficiency of the second position information of the cell image to be recognized in the form image, in the process of generating the semantic feature map of the form image according to the semantic features and the first position information of the first text information of each text box, the semantic feature map of the form image can be generated by combining the sizes of the image feature maps, so that the generated semantic feature maps and the image feature maps have the same size. One possible implementation of the above step 202 is exemplarily described below with reference to fig. 3, and as shown in fig. 3, may include:

step 301, generating an initial feature map with the same size as the image feature map, wherein the pixel values on the initial feature map are all zero.

Step 302, determining the reduction multiple of the form image according to the size of the image feature map and the size of the form image.

Step 303, determining the mapping position information of the first position information on the initial feature map according to the reduction multiple.

And step 304, filling the semantic features into the mapping position information to obtain a semantic feature map.

In other exemplary embodiments of the present disclosure, an initial feature map having the same size as the form image may be generated, where the pixel values on the initial feature map are all zero, and the semantic features are filled in the mapping position information to obtain the semantic feature map. In an embodiment, before determining the second position information of the cell image to be recognized in the form image based on the image feature map and the semantic feature map, the size of the semantic feature map may be further adjusted according to the size of the image feature map, so that the size of the semantic feature map is the same as the size of the image feature map.

In some exemplary embodiments, in order to accurately determine the second position information of the cell image to be recognized in the form image, step 203 determines one possible implementation manner of the second position information of the cell image to be recognized in the form image according to the image feature map and the semantic feature map, as shown in fig. 4, which may include:

step 401, performing feature fusion on the image feature map and the semantic feature map to obtain a fusion feature map.

In some exemplary embodiments, the image feature map and the semantic feature map may be input into a feature fusion model, so that the image feature map and the semantic feature map are feature fused by the feature fusion model to obtain a fused feature map.

Step 402, determining form structure information of the form image according to the fusion feature map, wherein the form structure information comprises a text label.

In the embodiment, the fused feature image fused with the image features and the semantic features can avoid the defect that the form structure information of the form image is determined by using single image feature information, and effectively improve the accuracy of the form structure information.

In addition, it can be understood that the use of the fusion feature map can reduce the dependency on the image features, so that when the image feature map of the table image is acquired, the table image features can be extracted by using a lighter convolutional neural network (such as MobileNet), and the efficiency of acquiring the image feature map of the table image is improved.

The MobileNet is a lightweight convolutional neural network.

In some exemplary embodiments, after obtaining the fused feature map, a structure decoder may be used to process the fused feature map to obtain the table structure information of the table image. Thus, the structure decoder processes the image features, and the table structure information of the table image can be easily acquired.

The structure decoder may include a Recurrent Neural Network (RNN), and the RNN is specifically, for example, a Long Short-Term Memory (LSTM) Network.

The table structure information refers to information of table layout, and may be specifically identified by a plurality of tags, for example, the table structure information may include the following tags: < thead >, < tr >, < td >, etc., wherein < thead > represents the header of the table, < tr > represents the row of the table, < td > and </td > are text labels, < td > corresponds to the start position of the text content in the cell image, and </td > corresponds to the end position of the text content in the cell image.

And step 403, taking the cell image corresponding to the text label in the form image as the cell image to be identified.

And step 404, determining second position information of the cell image to be identified in the form image based on the fusion feature map and the text label.

In an embodiment of the disclosure, a target decoding unit corresponding to the text label is obtained from a plurality of decoding units of the structure decoder, and a position decoder may be adopted to process the output feature of the target decoding unit and the fusion feature map to obtain second position information of the cell image to be identified. Therefore, the position decoder can be used for processing based on the fusion feature map and the text label, and the second position information of the cell image to be recognized in the form image can be simply and conveniently acquired.

The position decoder may be RNN, specifically, LSTM.

In some embodiments, for the sake of differentiation, the structure decoder may include an LSTM network that may be referred to as a first LSTM network and the location decoder may include an LSTM network that may be referred to as a second LSTM network.

The structure decoder and/or the position decoder may be a single-layer LSTM network, or may also be a multi-layer LSTM network.

In some exemplary embodiments, the LSTM network includes a plurality of cyclic units, the cyclic units are represented by circles in fig. 5, and the cyclic units in a row represent one layer, so fig. 5 exemplifies the LSTM network in which the structure decoder and the position decoder are both two layers.

In order to make the disclosure more clear, the technical solution of this embodiment is further described below with reference to fig. 6. It should be noted that fig. 6 is an exemplary diagram of a network structure of the structure decoder and the position decoder, as shown in fig. 5.

The convolutional neural network for extracting image features from the table image includes N convolutional layers and N deconvolution layers connected in sequence, where the size of the image feature map output by each convolutional layer in the N convolutional layers gradually decreases, and the size of the image feature output by each deconvolution layer in the N deconvolution layers gradually increases. When N takes a value of 5, an exemplary diagram of the structures of the convolutional layer and the deconvolution layer in the corresponding convolutional neural network is shown in fig. 6. Where, Ci in fig. 6 is the i-th convolutional layer, and Pi denotes the i-th deconvolution layer, where i is 1,2, …, 5.

An exemplary procedure for the form image processing method is as follows:

step 601, inputting the form image into an OCR recognizer to obtain text information and position information of a text box in the form image.

Step 602, inputting the text information into a kNowledge-Enhanced semantic Representation (ERNIE) model to obtain semantic features of the text information.

In some exemplary embodiments, the text information may be input into ERNIE to obtain semantic features of each character in the text information, and the semantic features of the text information may be determined according to the semantic features of each character.

Step 603, generating a semantic feature map according to the semantic features and the position information.

In this embodiment, the size of the image feature map output by the jth convolutional layer may be obtained, and an initial feature map of the size may be generated, where the pixel values in the initial feature map are all zero. Correspondingly, the reduction multiple of the form image can be determined according to the size and the size of the form image; determining the mapping position information of the first position information on the initial characteristic diagram according to the reduction multiple; and filling the semantic features into the mapping position information to obtain a semantic feature map. Wherein j may be any integer of 1 to 5.

In this embodiment, j is taken as 3 as an example for exemplary description.

It should be noted that, in the convolutional neural network in fig. 6, the input of the first hierarchical layer C3-1 in the 3 rd convolutional layer includes two parts, one part is the image feature map output by the second convolutional layer, and the other part is the semantic feature map of the table image.

And step 604, inputting the table image into a convolutional neural network to obtain a fusion characteristic diagram through the convolutional neural network.

Specifically, after the table image is input to the convolutional neural network, the first layer convolutional layer C1 in the convolutional neural network performs image feature extraction on the table image, and inputs the corresponding image feature map into the second layer convolutional layer C2. Correspondingly, the second layer convolutional layer C2 continues to perform image feature extraction on the image feature map output by the first layer convolutional layer C1 to obtain an image feature map output by the second layer convolutional layer C2. Correspondingly, the image feature map and the semantic feature map output by the second layer of convolutional layer C2 are input into the first layer of convolutional layer C3-1 in the third layer of convolutional layer C3, the first layer of convolutional layer C3-1 continues to perform image feature extraction on the image feature map output by the second layer of convolutional layer C2 to obtain a processed image feature map, performs feature fusion on the processed image feature map and the semantic feature map to obtain a fused feature map, uses the fused feature map as the output of the third layer of convolutional layer C3, and correspondingly, inputs the fused feature map into the fourth layer of convolutional layer C4 to continue image feature extraction. Correspondingly, the feature map output by the fourth layer convolutional layer C4 is input into the fifth layer convolutional layer C5, and image feature extraction is continuously performed on the feature map output by the fourth layer convolutional layer C4 through the fifth layer convolutional layer C5. Then, the characteristic diagrams output by the five convolutional layers are input into the deconvolution layers symmetrical to the convolutional layers to obtain the characteristic diagrams corresponding to the output of the convolutional layers. And then, performing feature fusion on the feature maps output by the convolution layers to obtain a fusion feature map. Note that reference a in fig. 6 is used to denote a fusion feature map.

In some exemplary embodiments, the ratio of the size of the fused feature map to the size of the form image in the present embodiment is a preset ratio, for example, the preset ratio may be 1/4. That is, the size of the fused feature map is 1/4 of the size of the form image.

Step 605, inputting the fused feature map into a structure decoder to obtain the table structure information of the table image.

Specifically, the fused feature map is input into a structure decoder, and the structure decoder decodes the fused feature map into a feature sequence specific to the table, such as sequence codes of "< thead > < tr > < td > </td > </tr >" and the like, so as to obtain a decoding result of the table structure.

In step 606, the feature of the decoding unit corresponding to the cell image predicted to correspond to the < td > tag (tag including characters) in the decoding result is extracted and input to the position decoder, and the second position information of the cell image including characters in the form image is predicted.

Step 607, obtaining the third position information matched with the second position information from the first position information, and using the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

In order to implement the above embodiments, the embodiments of the present disclosure also provide a form image processing apparatus.

Fig. 7 is a schematic diagram according to a seventh embodiment of the present disclosure, which provides a form image processing apparatus.

As shown in fig. 7, the form image processing apparatus 700 may include a first obtaining module 701, a first determining module 702, a second obtaining module 703, and a second determining module 704, wherein:

the first obtaining module 701 is configured to obtain first text information and first position information of each text box in the form image.

A first determining module 702, configured to determine second position information of the cell image to be identified in the form image.

The second obtaining module 703 is configured to obtain third location information that matches the second location information from the first location information.

And the second determining module 704 is configured to use the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

When the form image is processed, the method for processing the form image according to the embodiment of the disclosure acquires first text information and first position information of each text box in the form image, determines second position information of a cell image to be recognized in the form image, acquires third position information matched with the second position information from the first position information, and takes second text information in the text box corresponding to the third position information as text content in the cell image to be recognized. Therefore, the text content in the cell image to be identified in the form image is conveniently determined based on the text information and the position information of the text box in the form image, the complexity of form image processing is reduced, and the accuracy of form image processing is improved.

In one embodiment of the present disclosure, as shown in fig. 8, the form image processing apparatus 800 may include: a first obtaining module 801, a first determining module 802, a second obtaining module 803, and a second determining module 804, wherein the first determining module 802 may include: the obtaining sub-module 8021, the generating sub-module 8022 and the determining sub-module 8023, wherein the determining sub-module 8023 includes: a fusion unit 80231, a first determination unit 80232, a second determination unit 80233, and a third determination unit 80234.

It should be noted that, for the detailed description of the first obtaining module 801, the second obtaining module 803, and the second determining module 804, reference may be made to the description of the first obtaining module 701, the second obtaining module 703, and the second determining module 704 in fig. 7, and a description thereof is not further described here.

In an embodiment of the present disclosure, the first determining module 802 may include:

an obtaining submodule 8021 for obtaining an image feature map of the form image;

the generating submodule 8022 is configured to generate a semantic feature map of the form image according to the semantic feature of the first text information and the first position information;

the determining submodule 8023 is configured to determine, according to the image feature map and the semantic feature map, second position information of the cell image to be identified in the form image.

In an embodiment of the present disclosure, the generating submodule 8022 is specifically configured to: generating an initial feature map with the same size as the image feature map, wherein the pixel values on the initial feature map are all zero; determining the reduction multiple of the form image according to the size of the image feature map and the size of the form image; determining the mapping position information of the first position information on the initial characteristic diagram according to the reduction multiple; and filling the semantic features into the mapping position information to obtain a semantic feature map.

In an embodiment of the present disclosure, the determining sub-module 8023 includes:

a fusion unit 80231, configured to perform feature fusion on the image feature map and the semantic feature map to obtain a fusion feature map;

a first determining unit 80232, configured to determine table structure information of the table image according to the fused feature map, where the table structure information includes a text label;

a second determining unit 80233, configured to use the cell image corresponding to the text label in the form image as the cell image to be recognized;

the third determining unit 80234 is configured to determine, based on the fused feature map and the text label, second position information of the cell image to be recognized in the form image.

In an embodiment of the disclosure, the first determining unit 80232 is specifically configured to: and processing the fusion characteristic graph by adopting a structure decoder to obtain the table structure information of the table image.

In an embodiment of the disclosure, the first determining module 802 is specifically configured to: acquiring a target decoding unit corresponding to the text label from a plurality of decoding units of the structure decoder; and processing the output characteristic of the target decoding unit and the fusion characteristic graph by adopting a position decoder to obtain second position information of the cell image to be recognized.

It should be noted that the above explanation on the table image processing method is also applicable to the table image processing apparatus in this embodiment, and this embodiment is not repeated again.

The present disclosure also provides an electronic device and a readable storage medium and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 may include a computing unit 901, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the form image processing method. For example, in some embodiments, the form image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the form image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the form image processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the devices and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), devices on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable device including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage device, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the apparatus and techniques described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The apparatus and techniques described here can be implemented in a computing device that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the apparatus and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the device can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer device may include a client and a server. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may be a cloud server, a server of a distributed device, or a server combining a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A form image processing method, comprising:

acquiring first text information and first position information of each text box in a form image;

determining second position information of a cell image to be identified in the form image;

acquiring third position information matched with the second position information from the first position information;

and taking the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

2. The method of claim 1, wherein the determining second location information for the cell image to be identified in the form image comprises:

acquiring an image feature map of the form image;

generating a semantic feature map of the form image according to the semantic features of the first text information and the first position information;

and determining the second position information according to the image feature map and the semantic feature map.

3. The method of claim 2, the generating a semantic feature map for the form image from the semantic features of the first textual information and the first positional information, comprising:

generating an initial feature map with the same size as the image feature map, wherein all pixel values on the initial feature map are zero;

determining a reduction multiple of the form image according to the size of the image feature map and the size of the form image;

determining the mapping position information of the first position information on the initial feature map according to the reduction multiple;

and filling the semantic features into the mapping position information to obtain the semantic feature map.

4. The method according to claim 2 or 3, wherein the determining the second position information from the image feature map and the semantic feature map comprises:

performing feature fusion on the image feature map and the semantic feature map to obtain a fusion feature map;

according to the fusion feature map, determining table structure information of the table image, wherein the table structure information comprises a text label;

taking a cell image corresponding to the text label in the form image as the cell image to be identified;

determining the second location information based on the fused feature map and the text label.

5. The method of claim 4, wherein the determining table structure information for the table image from the fused feature map comprises:

and processing the fusion characteristic graph by adopting a structure decoder to obtain the table structure information.

6. The method of claim 5, wherein the determining the second location information based on the fused feature map and the text label comprises:

acquiring a target decoding unit corresponding to the text label from a plurality of decoding units of the structure decoder;

and processing the output characteristic of the target decoding unit and the fusion characteristic graph by adopting a position decoder to obtain second position information of the cell image to be recognized.

7. A form image processing apparatus comprising:

the first obtaining module is used for obtaining first text information and first position information of each text box in the form image;

the first determining module is used for determining second position information of the cell image to be identified in the form image;

the second acquisition module is used for acquiring third position information matched with the second position information from the first position information;

and the second determining module is used for taking the second text information in the text box corresponding to the third position information as the text content in the cell image to be recognized.

8. The apparatus of claim 7, wherein the first determining means comprises:

the obtaining submodule is used for obtaining an image characteristic diagram of the form image;

the generating submodule is used for generating a semantic feature map of the form image according to the semantic features and the first position information of the first text information of each text box;

and the determining submodule is used for determining the second position information according to the image feature map and the semantic feature map.

9. The apparatus of claim 8, the generation submodule being specifically configured to:

10. The apparatus of claim 8 or 9, wherein the determination submodule comprises:

the fusion unit is used for carrying out feature fusion on the image feature map and the semantic feature map to obtain a fusion feature map;

a first determining unit, configured to determine, according to the fused feature map, table structure information of the table image, where the table structure information includes a text label;

a second determining unit, configured to use a cell image corresponding to the text label in the form image as the cell image to be recognized;

and the third determining unit is used for determining second position information of the cell image to be recognized in the form image based on the fusion feature map and the text label.

11. The apparatus according to claim 10, wherein the first determining unit is specifically configured to:

and processing the fusion feature map by adopting a structure decoder to obtain the table structure information of the table image.

12. The apparatus of claim 11, wherein the first determining module is specifically configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1-6.