CN116152833B

CN116152833B - Training method of form restoration model based on image and form restoration method

Info

Publication number: CN116152833B
Application number: CN202211735420.5A
Authority: CN
Inventors: 李晨辉; 柯博; 胡腾; 冯仕堃; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-11-24
Anticipated expiration: 2042-12-30
Also published as: CN116152833A

Abstract

The present disclosure provides a training method and a table restoration method for a table restoration model based on images, which relate to the field of artificial intelligence, in particular to image processing, deep learning and natural language processing technologies, and specifically implement the following schemes: acquiring a first image vector representation of a table image and a first text vector representation of a text in the table image as well as a position vector representation corresponding to the text, performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table restoration model to acquire a second image vector representation and a second text vector representation, and outputting a category set and a detection frame set of the table image; based on the category set and the detection frame set, model parameters of the form restoration model are adjusted, and training is continued until a final target form restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.

Description

Training method of form restoration model based on image and form restoration method

Technical Field

The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence, and in particular to image processing, machine learning, deep learning, and natural language processing techniques.

Background

In the related art, when performing table restoration, a traditional method based on image analysis is generally adopted, that is, lines and columns of cells and merging cells are divided by using original table frame lines of a picture, or a single-mode method based on image recognition is adopted, a line area is detected from the picture, or processing is performed based on an artificial intelligence method, which text fragments belong to the same cell in the picture is firstly recognized, whether the text fragments belong to the same cell is judged, and then whether the text fragments belong to the same row and the same column is judged by combining text information of the same cell with image information. Therefore, how to obtain a converged form restoration model through training and efficiently and accurately perform form restoration based on the form restoration model has become one of important research directions.

Disclosure of Invention

The disclosure provides a training method of a form restoration model based on an image and a form restoration method based on the image.

According to an aspect of the present disclosure, there is provided a training method of an image-based form restoration model, including: acquiring a first image vector representation of the table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text; performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation; acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells; and adjusting model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.

According to another aspect of the present disclosure, there is provided an image-based table restoration method, including:

acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text; inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection boxes and types of each detection box; and carrying out table reduction processing according to the recognition result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-9.

According to another aspect of the present disclosure, there is provided a training apparatus for an image-based tabular restoration model, including:

the acquisition module is used for acquiring a first image vector representation of the table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;

the cross-modal module is used for carrying out cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by the table reduction model to obtain a second image vector representation and a second text vector representation;

the output module is used for acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells;

and the adjusting module is used for adjusting the model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.

According to another aspect of the present disclosure, there is provided an image-based form restoration apparatus including:

The vector representation module is used for acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text;

the input module is used for inputting the image vector representation, the text vector representation and the position vector representation into a target table reduction model so as to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection frames and types of each detection frame;

and the reduction module is used for carrying out table reduction processing according to the identification result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-9.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the image-based table reduction model of the first aspect of the present disclosure or the image-based table reduction method of the second aspect.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the image-based table reduction model according to the first aspect of the present disclosure or the image-based table reduction method according to the second aspect.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the training method of the image-based table reduction model according to the first aspect of the present disclosure or the image-based table reduction method according to the second aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a segmentation of a table image;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a form reduction model;

FIG. 9 is a schematic diagram of an image-based table restoration method;

FIG. 10 (a) is a schematic illustration of a target form image to be identified;

FIG. 10 (b) is a schematic diagram of another target table image restoration result;

FIG. 10 (c) is a schematic diagram of another target form image restoration result;

FIG. 10 (d) is a schematic diagram of another target table image restoration result;

FIG. 10 (e) is a schematic diagram of another target form image restoration result;

FIG. 10 (f) is a schematic diagram of a target reduction table;

FIG. 11 is a block diagram of an image-based form restoration model training apparatus for implementing a method of training an image-based form restoration model according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an image-based form restoration device for implementing an image-based form restoration method of an embodiment of the present disclosure;

Fig. 13 is a block diagram of an electronic device for implementing the training method of the image-based form restoration model and the image-based form restoration method of the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The technical field to which the aspects of the present disclosure relate is briefly described below:

computer technology (Computer Technology) is very widely divided into computer system technology, computing machine component technology, computer component technology, and computer assembly technology. The computer technology comprises: the basic principle of the operation method and the design of an arithmetic unit, an instruction system, a central processing unit (Central Processing Unit, CPU), a pipeline principle and the application of the pipeline principle in the CPU design, a storage system, a bus and input and output.

AI (Artificial Intelligence ) is a discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) that make computers simulate life, both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include computer vision technologies, speech recognition technologies, natural language processing technologies, and learning/deep learning, big data processing technologies, knowledge graph technologies, and the like.

An Image Processing technique (Image Processing technique) is a technique for Processing Image information with a computer. Mainly comprises image digitizing, image enhancing and restoring, image data encoding, image dividing, image identifying and the like.

Machine Learning (ML) is a fundamental approach to make computers intelligent by specially researching how to simulate or implement Learning behavior of human beings to acquire new knowledge or skills and reorganizing existing knowledge structures to continuously improve their own performances.

DL (Deep Learning), a new research direction in the field of machine Learning, was introduced into machine Learning to make it closer to the original goal-artificial intelligence. Deep learning is the inherent law and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.

Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relation with the research in linguistics, but has important differences. Natural language processing is not a general study of natural language, but rather, is the development of computer systems, and in particular software systems therein, that can effectively implement natural language communications. Therefore, the method is a part of computer science, and natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstract, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and the like.

A training method of an image-based form restoration model and an image-based form restoration method according to an embodiment of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The execution subject of the training method of the image-based table restoration model of the present embodiment is an image-based table restoration model training apparatus, and the image processing apparatus may specifically be a hardware device, or software in a hardware device. Wherein the hardware devices such as terminal devices, servers, etc.

As shown in fig. 1, the training method of the image-based table restoration model provided in the embodiment includes the following steps:

s101, acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text.

It should be noted that the present disclosure is not limited to a specific manner of acquiring the first image vector representation of the tabular image.

Alternatively, the table image may be segmented to obtain a plurality of image slices, and then the image slices are input into the feature extraction network to obtain multi-scale image information, and the multi-scale image information is connected together to be represented as the first image vector.

Note that, the specific manner of acquiring the first text vector representation of the text and the position vector representation corresponding to the text in the form image is not limited in this disclosure.

Optionally, the table image may be input into an optical character recognition model, and the text and the position information corresponding to the text are obtained, and any text input word segmentation device is segmented (token), and the pre-trained text representation dictionary is queried, and the first text vector representation is obtained, and the pre-trained two-dimensional position information representation dictionary is queried, and the position vector representation corresponding to the text is obtained.

S102, performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation.

Alternatively, after the first image vector representation, the first text vector representation and the position vector representation are obtained, the position vector representation may be added to the first image vector representation to obtain a fused image vector representation, and the position vector representation may be added to the first text vector representation to obtain a fused text vector representation.

Further, the fused image vector representation and the fused text vector representation may be input into an encoder of the table restoration model for cross-modal multi-layer self-attention focusing, thereby obtaining a second image vector representation and a second text vector representation.

S103, acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells.

Optionally, K anchor point vectors containing position information can be selected by a selector respectively, and the historical query vector of any vector representation obtained by last training of the table restoration model is added with the anchor point vector to obtain respective query vectors of the second image vector representation and the second text vector representation.

Wherein, when training for the first time, K all 0 initialized learnable query vectors (query) may be added to the anchor vector to obtain respective query vectors of the second image vector representation and the second text vector representation.

It should be noted that, the query vector represented by the second image vector may be added to the second image vector representation, the query vector represented by the second text vector and the second text vector representation may be input into the encoder of the table restoration model to perform multi-layer self-attention, and the decoder of the table restoration model outputs the category set and the detection frame set of the table image.

And S104, adjusting model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.

After the category set and the detection frame set are acquired, a loss function of the table reduction model can be acquired, model parameters of the table reduction model are adjusted according to the loss function, and training is continued until a final target table reduction model is obtained.

Alternatively, the loss function of the tabular reduction model may be a category loss function and a location loss function for which the tabular reduction model is obtained.

It should be noted that, the setting of the training stop condition is not limited in this disclosure, and may be selected according to actual situations.

For example, the training stop condition may be set such that the loss function of the table reduction model reaches a preset loss function threshold; the training stop condition may be set such that the number of adjustments of the model parameters of the table restoration model reaches a preset number threshold.

After the training stop condition is satisfied, a final target table recovery model may be obtained.

According to the training method of the image-based table restoration model, through obtaining a first image vector representation of a table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text, cross-modal attention is conducted on the first image vector representation, the first text vector representation and the position vector representation by the table restoration model to obtain a second image vector representation and a second text vector representation, respective query vectors of the second image vector representation and the second text vector representation are obtained, the query vectors, the second image vector representation and the second text vector representation are based, a class set and a detection frame set of the table image are output, the class set comprises rows, columns and merging cells, model parameters of the table restoration model are adjusted based on the class set and the detection frame set, and training is continued until a final target table restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 2, the training method of the image-based table restoration model provided in the embodiment includes the following steps:

s201, acquiring a first image vector representation of the form image, a first text vector representation of a text in the form image and a position vector representation corresponding to the text.

As a possible implementation manner, as shown in fig. 3, on the basis of the foregoing embodiment, the specific process of acquiring the first image vector representation of the table image in the foregoing step S201 includes the following steps:

s301, cutting the table image to obtain a plurality of image slices.

For example, as shown in fig. 4, when a table image is given, the table image may be split into 16 x 16 image slices, 4*4 image slices, and 8 x 8 image slices.

S302, inputting a plurality of image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and performing feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size.

Optionally, for a self-attention layer i of the plurality of self-attention layers, acquiring an image feature i output by the self-attention layer i, combining adjacent image features i to obtain a plurality of image feature groups, and inputting the image feature groups into the self-attention layer i+1 for feature extraction, wherein i is an integer greater than or equal to 1.

S303, obtaining a first image vector representation based on the image characteristics output by each layer.

After the image features output by each layer are obtained, the image features output by each layer may be combined to obtain the first image vector representation.

As a possible implementation manner, as shown in fig. 5, on the basis of the foregoing embodiment, the specific process of obtaining the first text vector representation and the position vector representation of the form image in the foregoing step S201 includes the following steps:

s501, performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text.

Optical character recognition (Optical Character Recognition, abbreviated as OCR) refers to a process of determining a shape by detecting dark and bright patterns and then translating the shape into computer text by a character recognition method.

Alternatively, the form image may be input into an optical character recognition OCR model, for example: and acquiring all texts in the form image and position information corresponding to each text by using the paddleOCR model.

For example, the form image is input into an optical character recognition OCR model, a character string of each text segment in the form image is obtained, and all texts and position information corresponding to each text are obtained. The corresponding position information of the text is represented by boxes [ x, y, w and h ], each box is a floating point number list and comprises 4 floating point numbers, wherein x is an abscissa of the text in a coordinate system, y is an ordinate of the text in the coordinate system, w is a width of the text in the x direction of the abscissa, and h is a height of the text in the y direction of the ordinate.

S502, inputting any text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining vector representations corresponding to the text token.

Optionally, any text input word segmentation device (token) may be used to segment, and the segmented character token may be queried into a pre-trained text representation (text email) dictionary, and the text token may be converted into a vectorized representation, so as to obtain a vector representation corresponding to the text token.

Wherein, all text tokens of the same segment use the same position information.

S503, according to the vector representation of the text token, obtaining a first text vector representation of any text.

In the embodiment of the application, after the vector representation of the text token is obtained, the first text vector representation of any text can be obtained according to the vector representation of the text token.

S504, querying a pre-trained two-dimensional position information representation dictionary for any position to obtain a position vector representation of the position information.

Alternatively, a position vector representation of the position information may be obtained by querying a pre-trained two-dimensional position information representation dictionary (2D position embedding) for any position.

S202, adding the position vector representation and the first image vector representation to obtain a fusion image vector representation.

After the position vector representation and the first image vector representation are obtained, the position vector representation and the first image vector representation may be added in a vector addition manner to obtain a fused image vector representation.

And S203, adding the position vector representation and the first text vector representation to obtain a fusion text vector representation.

After the position vector representation and the first text vector representation are obtained, the position vector representation and the first text vector representation may be added in a vector addition manner to obtain a fused text vector representation.

S204, the fusion image vector representation and the fusion text vector representation are input into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing, and a second image vector representation and a second text vector representation are obtained.

S205, acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells.

As a possible implementation manner, as shown in fig. 6, on the basis of the foregoing embodiment, a specific process of obtaining respective query vectors of the second image vector representation and the second text vector representation in the foregoing step S205 includes the following steps:

s601, selecting an anchor point vector containing position information for any one of the second image vector representation and the second text vector representation.

Alternatively, any vector representation may be input into the corresponding selector, and the corresponding selector selects the partially largest vector representation from any vector representation as the anchor vector of any vector representation.

S602, acquiring a historical query vector represented by any vector obtained by last training of a table restoration model.

S603, adding the anchor point vector and the historical query vector to obtain a query vector represented by any vector.

Alternatively, the anchor vector and the historical query vector may be added to obtain a query vector (query) represented by either vector.

As a possible implementation manner, as shown in fig. 7, based on the above embodiment, the specific process of outputting the category set and the detection frame set of the table image based on the query vector, the second image vector representation and the second text vector representation in the above step S205 includes the following steps:

And S701, adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation.

Alternatively, the second image vector representation and the query vector of the second image vector representation may be added to obtain the third image vector representation.

S702, adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation.

Alternatively, the second text vector representation and the query vector of the second text vector representation may be added to obtain a third text vector representation.

S703, inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention, and outputting a category set and a detection frame set of the table image.

It should be noted that, multi-layer Self-Attention may be performed by a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention), and the category set and the detection frame set of the output table image may be decoded by a Decoder (Decoder).

Wherein the category set includes rows, columns, and merging cells.

S206, based on the category set and the detection frame set, adjusting model parameters of the form restoration model, and continuing training until a final target form restoration model is obtained.

Optionally, the prediction category in the category set and the prediction detection frame in the detection frame set may be subjected to hungarian optimal matching to obtain a matching result of the prediction detection frame and the prediction type, a category loss function and a position loss function of the table reduction model are obtained according to the matching result and a label result of the table image, a loss function of the table reduction model is obtained according to the category loss function and the position loss function, and model parameters of the table reduction model are adaptively optimized based on the loss function.

Optionally, after the category loss function and the location loss function are obtained, the category loss function and the location loss function may be summed to obtain a loss function of the table reduction model.

It should be noted that, after the loss function is obtained, the model parameters of the table restoration model may be adaptively optimized based on the loss function by an adaptive estimation algorithm (Adaptive moment estimation, adam for short), and training may be continued until a final target table restoration model is obtained.

The table reduction model proposed by the application is explained below.

For example, as shown in fig. 8, the architecture of the table restoration model includes a text downsampling layer and a text vector representation layer from the lower layer to the upper layer, all text in the table image can be converted from natural language words to mathematical text vector representations, the image vector representation layer Swin-Large can convert the table image from three primary colors and other graphical representations into mathematical image vector representations, the text and the image have corresponding two-dimensional position information representations, and are respectively converted into corresponding position vector representations through a query pre-training model, before the Encoder (encodings), the position vector representations are respectively added with the text vector representations and the image vector representations through a vector addition form, the text vector fused with the position information is connected (content) with the image vector representations as an input of the Encoder, erni-layout is a table restoration model based on full-connection transporter structure pre-training, the text token and the image token in the Encoder are paid Attention by a Multi-Self-Attention mechanism (Multi-Self-Attention-layer) in the Multi-Attention mechanism, thereby the position vector representations are respectively weighted with the Decoder, the Decoder can be used as an input of the Decoder, the Decoder can Select the input of the text vector representations, the text vector representations fused with the position information are connected with the image vector representations as an input of the Decoder, the text vectors can be cross-decoded with the input of the Decoder, the text vectors are cross-layer is the input by the Decoder, and the text is cross-coded with the text vectors of the text vectors, column and merge cells) and a set of detection boxes (boxes).

According to the training method of the image-based table restoration model, through obtaining the image vector representation, the text vector representation structure and the position vector representation corresponding to the text of the table image, the table restoration model can fully utilize cross-modal information, accuracy and reliability in the training process of the image-based table restoration model are improved, and the table restoration can be subsequently carried out based on the training target table restoration model, so that accuracy of table restoration is improved.

Fig. 9 is a schematic diagram according to a seventh embodiment of the present disclosure. It should be noted that, the execution body of the image-based form restoration method in this embodiment is an image-based form restoration device, and the image-based form restoration device may specifically be a hardware device, or software in a hardware device, or the like. Wherein the hardware devices such as terminal devices, servers, etc.

As shown in fig. 9, the image-based table restoration method provided in this embodiment includes the following steps:

s901, acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text.

It should be noted that the present disclosure is not limited to a specific manner of acquiring the first image vector representation of the target table image.

Alternatively, the target table image may be segmented to obtain a plurality of image slices, and then the image slices are input into the feature extraction network to obtain multi-scale image information, and the multi-scale image information is connected together to be represented as an image vector.

Note that, the present disclosure is not limited to a specific manner of acquiring the text vector representation of the text and the position vector representation corresponding to the text in the target form image.

Optionally, the table image may be input into an optical character recognition model, to obtain a text and position information corresponding to the text, and any text input word segmentation device is segmented (token), and a pre-trained text representation dictionary is queried to obtain a text vector representation, and a pre-trained two-dimensional position information representation dictionary is queried to obtain a position vector representation corresponding to the text.

S902, inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection frames and types of each detection frame.

In the embodiment of the disclosure, after the image vector representation, the text vector representation and the position vector representation are acquired, the image vector representation, the text vector representation and the position vector representation may be input into a target table restoration model to output a recognition result corresponding to the target table image, where the recognition result includes a detection frame and a type of each detection frame.

S903, performing table restoration processing according to the recognition result and the position information of the text to obtain a target restoration table of the target table image.

The target table reduction model is a trained convergence model.

Optionally, after the identification result and the position information of the text are obtained, sorting and crossing can be performed according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image, determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of a merging cell type in the identification result, merging the first candidate cell to obtain a merging cell, obtaining a table to be filled based on the merging cell and the remaining second candidate cell in the candidate cell, filling the text into the table to be filled according to the position information of the text, and obtaining a target reduction table of the target table image.

For example, as shown in fig. 10 (a), a row type restoration result of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (b), a column type restoration result of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (c), a restoration result of the merging cell type of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (d), and then the above three restoration results may be fused to obtain a complete structure of the wireless table, as shown in fig. 10 (e), and a text may be filled into the table to be filled according to the position information of the text, to obtain a target restoration table of the target table image, as shown in fig. 10 (f).

According to the image-based form restoration method disclosed by the embodiment of the disclosure, by acquiring a target form image to be identified, acquiring an image vector representation of the target form image and a text vector representation of a text in the target form image and a position vector representation corresponding to the text, inputting the image vector representation, the text vector representation and the position vector representation into a target form restoration model to output an identification result corresponding to the target form image, wherein the identification result comprises a detection frame and the type of each detection frame, and performing form restoration processing according to the identification result and the position information of the text to obtain a target restoration form of the target form image. According to the method and the device for restoring the target table image, the table structure of the target table image to be identified is restored through the target table restoring model, accuracy of restoring the table is guaranteed, and therefore the utilization rate of the target table image to be identified is improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

Corresponding to the training method of the image-based table restoration model provided in the foregoing several embodiments, an embodiment of the present disclosure further provides a training device of the image-based table restoration model, and since the training device of the image-based table restoration model provided in the embodiment of the present disclosure corresponds to the training method of the image-based table restoration model provided in the foregoing several embodiments, implementation of the training method of the image-based table restoration model is also applicable to the training device of the image-based table restoration model provided in the embodiment, and will not be described in detail in the present embodiment.

FIG. 11 is a schematic diagram of a training apparatus for image-based tabular restoration model in accordance with one embodiment of the present disclosure.

As shown in fig. 11, the training device 1000 for the image-based table restoration model includes: an acquisition module 1010, a cross-modality module 1020, an output module 1030, and an adjustment module 1040, wherein:

an obtaining module 1010, configured to obtain a first image vector representation of the table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text;

A cross-modal module 1020 configured to perform cross-modal attention on the first image vector representation, the first text vector representation, and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation;

an output module 1030, configured to obtain respective query vectors of the second image vector representation and the second text vector representation, and output a category set and a detection frame set of the table image based on the query vectors, the second image vector representation, and the second text vector representation, where the category set includes rows, columns, and merging cells;

and the adjusting module 1040 is configured to adjust model parameters of the table reduction model based on the category set and the detection frame set, and continue training until a final target table reduction model is obtained.

Wherein, output module 1030 is further configured to:

selecting an anchor point vector containing position information aiming at any one of the second image vector representation and the second text vector representation;

acquiring a historical query vector represented by any vector obtained by last training of the table reduction model;

And adding the anchor point vector and the historical query vector to obtain the query vector represented by any vector.

Wherein, output module 1030 is further configured to:

and inputting any vector representation into a corresponding selector, and selecting the vector representation with the largest part from the any vector representations by the corresponding selector as an anchor point vector of any vector representation.

Wherein, cross-modality module 1020 is further configured to:

adding the position vector representation to the first image vector representation to obtain a fused image vector representation;

adding the position vector representation to the first text vector representation to obtain a fused text vector representation;

and inputting the fused image vector representation and the fused text vector representation into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing to obtain the second image vector representation and the second text vector representation.

Wherein, output module 1030 is further configured to:

adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation;

adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation;

And inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention focusing, and outputting a category set and a detection frame set of the table image.

Wherein, the obtaining module 1010 is further configured to:

cutting the table image to obtain a plurality of image fragments;

inputting the image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and carrying out feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size;

the first image vector representation is derived based on the image features output by each layer.

Wherein the apparatus 1000 is further configured to:

aiming at a self-attention layer i in a plurality of self-attention layers, acquiring an image feature i output by the self-attention layer i, and combining adjacent image features i to obtain a plurality of image feature groups;

and inputting the image feature group into a self-attention layer i+1 for feature extraction, wherein i is an integer greater than or equal to 1.

Wherein, the obtaining module 1010 is further configured to:

Performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text;

for any text, inputting the text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining a vector representation corresponding to the text token;

obtaining a first text vector representation of any text according to the vector representation of the text token;

and querying a pre-trained two-dimensional position information representation dictionary for any position, and obtaining a position vector representation of the position information.

Wherein, adjustment module 1040 is further configured to:

performing Hungary optimal matching on the prediction category in the category set and the prediction detection frame in the detection frame set to obtain a matching result of the prediction detection frame and the prediction type;

obtaining a category loss function and a position loss function of the table restoration model according to the matching result and the label result of the table image,

obtaining a loss function of the table reduction model according to the category loss function and the position loss function;

And based on the loss function, adaptively optimizing model parameters of the table restoration model.

According to the training device of the image-based table restoration model, through obtaining a first image vector representation of a table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text, cross-modal attention is conducted on the first image vector representation, the first text vector representation and the position vector representation by the table restoration model to obtain a second image vector representation and a second text vector representation, respective query vectors of the second image vector representation and the second text vector representation are obtained, the query vectors, the second image vector representation and the second text vector representation are based, a class set and a detection frame set of the table image are output, the class set comprises rows, columns and merging cells, model parameters of the table restoration model are adjusted based on the class set and the detection frame set, and training is continued until a final target table restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.

Corresponding to the image-based form restoration method provided in the above embodiments, an embodiment of the present disclosure further provides an image-based form restoration device, and since the image-based form restoration device provided in the embodiment of the present disclosure corresponds to the image-based form restoration method provided in the above embodiments, implementation of the image-based form restoration method is also applicable to the image-based form restoration device provided in the embodiment, and will not be described in detail in the embodiment.

Fig. 12 is a schematic structural diagram of an image-based table restoring apparatus according to an embodiment of the present disclosure.

As shown in fig. 12, the image-based table restoring apparatus 1200 includes: a vector representation module 1210, an input module 1220, and a reduction module 1230. Wherein:

a vector representation module 1210, configured to obtain a target table image to be identified, and obtain an image vector representation of the target table image, a text vector representation of a text in the target table image, and a position vector representation corresponding to the text;

an input module 1220, configured to input the image vector representation, the text vector representation, and the position vector representation into a target table restoration model, so as to output a recognition result corresponding to the target table image, where the recognition result includes a detection frame and a type of each detection frame;

A restoring module 1230, configured to perform a table restoring process according to the recognition result and the location information of the text, to obtain a target restoring table of the target table image, where the target table restoring model is a model obtained by using the training method according to any one of claims 1-9.

Wherein, the reduction module 1230 is further configured to:

sequencing and crossing according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image;

determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of the merging cell type in the identification result, and carrying out cell merging on the first candidate cell to obtain the merging cell;

obtaining a form to be filled based on the merging cell and the remaining second candidate cells in the candidate cells;

and filling text into the to-be-filled form according to the position information of the text to obtain a target reduction form of the target form image.

According to the image-based form restoration device of the embodiment of the disclosure, by acquiring a target form image to be identified, acquiring an image vector representation of the target form image and a text vector representation of a text in the target form image and a position vector representation corresponding to the text, inputting the image vector representation, the text vector representation and the position vector representation into a target form restoration model to output an identification result corresponding to the target form image, wherein the identification result comprises a detection frame and the type of each detection frame, and performing form restoration processing according to the identification result and the position information of the text to obtain a target restoration form of the target form image. According to the method and the device for restoring the target table image, the table structure of the target table image to be identified is restored through the target table restoring model, accuracy of restoring the table is guaranteed, and therefore the utilization rate of the target table image to be identified is improved.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 13 illustrates a schematic block diagram of an example electronic device 1300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the apparatus 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data required for the operation of the device 1300 can also be stored. The computing unit 1301, the ROM 1302, and the RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

Various components in device 1300 are connected to I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, or the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, etc.; and a communication unit 1309 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1301 performs the respective methods and processes described above, such as a training method of an image-based form restoration model or an image-based form restoration method. For example, in some embodiments, the training method of the image-based form restoration model or the image-based form restoration method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1300 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the above-described training method of the image-based form restoration model or the image-based form restoration method may be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured to perform the training method of the image-based form restoration model or the image-based form restoration method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the training method of an image-based form restoration model or the image-based form restoration method as described above.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training an image-based tabular reduction model, wherein the method comprises:

acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;

Performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation;

acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells;

based on the category set and the detection frame set, adjusting model parameters of the form restoration model, and continuing training until a final target form restoration model is obtained;

wherein the cross-modal focus by the table reduction model on the first image vector representation, the first text vector representation, and the position vector representation further comprises:

2. The method of claim 1, wherein the obtaining the respective query vectors of the second image vector representation and the second text vector representation comprises:

3. The method of claim 2, wherein the selecting the anchor vector represented by any one of the vectors comprises:

4. The method of any of claims 1-3, wherein the outputting the set of categories and the set of detection boxes of the tabular image based on the query vector, the second image vector representation, and the second text vector representation comprises:

5. A method according to any of claims 1-3, wherein the acquiring a first image vector representation of the tabular image comprises:

cutting the table image to obtain a plurality of image fragments;

6. The method of claim 5, wherein the method further comprises:

for a self-attention layer of a plurality of self-attention layersiAcquiring the self-attention layeriOutput image featuresiAnd for adjacent image featuresiCombining to obtain a plurality of image feature groups;

inputting the set of image features into a self-attention layeri+1, wherein,iis an integer greater than or equal to 1.

7. A method according to any of claims 1-3, wherein obtaining the first text vector representation and the position vector representation of the tabular image comprises:

8. The method of any of claims 1-3, wherein the adjusting model parameters of the tabular reduction model based on the set of categories and the set of detection boxes comprises:

performing Hungary optimal matching on the prediction category in the category set and the prediction detection frame in the detection frame set to obtain a matching result of the prediction detection frame and the prediction category;

9. A method of image-based table restoration, wherein the method comprises:

acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text;

inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection boxes and types of each detection box;

And carrying out table reduction processing according to the recognition result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-8.

10. The method of claim 9, wherein the performing a table reduction process according to the recognition result and the location information of the text to obtain the target reduction table of the target table image includes:

11. An image-based form restoration model training apparatus, wherein the apparatus comprises:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;

the adjusting module is used for adjusting the model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained;

wherein, cross-modal module is further configured to:

12. The apparatus of claim 11, wherein the output module is further configured to:

13. The apparatus of claim 12, wherein the output module is further to:

14. The apparatus of any of claims 11-13, wherein the output module is further to:

15. The apparatus of any of claims 11-13, wherein the acquisition module is further to:

cutting the table image to obtain a plurality of image fragments;

16. The apparatus of claim 15, wherein the apparatus is further configured to:

17. The apparatus of any of claims 11-13, wherein the acquisition module is further to:

18. The apparatus of any of claims 11-13, wherein the adjustment module is further to:

19. An image-based form restoration apparatus, wherein the apparatus is further configured to:

And the reduction module is used for carrying out table reduction processing according to the identification result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-8.

20. The apparatus of claim 19, wherein the reduction module is further to:

21. An electronic device comprising a processor and a memory;

Wherein the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for implementing the method according to any of claims 1-8 or 9-10.

22. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-8 or 9-10.