CN116152833B - Training method of form restoration model based on image and form restoration method - Google Patents

Training method of form restoration model based on image and form restoration method Download PDF

Info

Publication number
CN116152833B
CN116152833B CN202211735420.5A CN202211735420A CN116152833B CN 116152833 B CN116152833 B CN 116152833B CN 202211735420 A CN202211735420 A CN 202211735420A CN 116152833 B CN116152833 B CN 116152833B
Authority
CN
China
Prior art keywords
image
vector representation
text
vector
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211735420.5A
Other languages
Chinese (zh)
Other versions
CN116152833A (en
Inventor
李晨辉
柯博
胡腾
冯仕堃
陈永锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211735420.5A priority Critical patent/CN116152833B/en
Publication of CN116152833A publication Critical patent/CN116152833A/en
Application granted granted Critical
Publication of CN116152833B publication Critical patent/CN116152833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Abstract

The present disclosure provides a training method and a table restoration method for a table restoration model based on images, which relate to the field of artificial intelligence, in particular to image processing, deep learning and natural language processing technologies, and specifically implement the following schemes: acquiring a first image vector representation of a table image and a first text vector representation of a text in the table image as well as a position vector representation corresponding to the text, performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table restoration model to acquire a second image vector representation and a second text vector representation, and outputting a category set and a detection frame set of the table image; based on the category set and the detection frame set, model parameters of the form restoration model are adjusted, and training is continued until a final target form restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.

Description

Training method of form restoration model based on image and form restoration method
Technical Field
The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence, and in particular to image processing, machine learning, deep learning, and natural language processing techniques.
Background
In the related art, when performing table restoration, a traditional method based on image analysis is generally adopted, that is, lines and columns of cells and merging cells are divided by using original table frame lines of a picture, or a single-mode method based on image recognition is adopted, a line area is detected from the picture, or processing is performed based on an artificial intelligence method, which text fragments belong to the same cell in the picture is firstly recognized, whether the text fragments belong to the same cell is judged, and then whether the text fragments belong to the same row and the same column is judged by combining text information of the same cell with image information. Therefore, how to obtain a converged form restoration model through training and efficiently and accurately perform form restoration based on the form restoration model has become one of important research directions.
Disclosure of Invention
The disclosure provides a training method of a form restoration model based on an image and a form restoration method based on the image.
According to an aspect of the present disclosure, there is provided a training method of an image-based form restoration model, including: acquiring a first image vector representation of the table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text; performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation; acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells; and adjusting model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.
According to another aspect of the present disclosure, there is provided an image-based table restoration method, including:
acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text; inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection boxes and types of each detection box; and carrying out table reduction processing according to the recognition result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-9.
According to another aspect of the present disclosure, there is provided a training apparatus for an image-based tabular restoration model, including:
the acquisition module is used for acquiring a first image vector representation of the table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;
the cross-modal module is used for carrying out cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by the table reduction model to obtain a second image vector representation and a second text vector representation;
the output module is used for acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells;
and the adjusting module is used for adjusting the model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.
According to another aspect of the present disclosure, there is provided an image-based form restoration apparatus including:
The vector representation module is used for acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text;
the input module is used for inputting the image vector representation, the text vector representation and the position vector representation into a target table reduction model so as to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection frames and types of each detection frame;
and the reduction module is used for carrying out table reduction processing according to the identification result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-9.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the image-based table reduction model of the first aspect of the present disclosure or the image-based table reduction method of the second aspect.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the image-based table reduction model according to the first aspect of the present disclosure or the image-based table reduction method according to the second aspect.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the training method of the image-based table reduction model according to the first aspect of the present disclosure or the image-based table reduction method according to the second aspect.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;
FIG. 4 is a schematic illustration of a segmentation of a table image;
FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;
FIG. 7 is a schematic diagram according to a sixth embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a form reduction model;
FIG. 9 is a schematic diagram of an image-based table restoration method;
FIG. 10 (a) is a schematic illustration of a target form image to be identified;
FIG. 10 (b) is a schematic diagram of another target table image restoration result;
FIG. 10 (c) is a schematic diagram of another target form image restoration result;
FIG. 10 (d) is a schematic diagram of another target table image restoration result;
FIG. 10 (e) is a schematic diagram of another target form image restoration result;
FIG. 10 (f) is a schematic diagram of a target reduction table;
FIG. 11 is a block diagram of an image-based form restoration model training apparatus for implementing a method of training an image-based form restoration model according to an embodiment of the present disclosure;
FIG. 12 is a block diagram of an image-based form restoration device for implementing an image-based form restoration method of an embodiment of the present disclosure;
Fig. 13 is a block diagram of an electronic device for implementing the training method of the image-based form restoration model and the image-based form restoration method of the embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The technical field to which the aspects of the present disclosure relate is briefly described below:
computer technology (Computer Technology) is very widely divided into computer system technology, computing machine component technology, computer component technology, and computer assembly technology. The computer technology comprises: the basic principle of the operation method and the design of an arithmetic unit, an instruction system, a central processing unit (Central Processing Unit, CPU), a pipeline principle and the application of the pipeline principle in the CPU design, a storage system, a bus and input and output.
AI (Artificial Intelligence ) is a discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) that make computers simulate life, both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include computer vision technologies, speech recognition technologies, natural language processing technologies, and learning/deep learning, big data processing technologies, knowledge graph technologies, and the like.
An Image Processing technique (Image Processing technique) is a technique for Processing Image information with a computer. Mainly comprises image digitizing, image enhancing and restoring, image data encoding, image dividing, image identifying and the like.
Machine Learning (ML) is a fundamental approach to make computers intelligent by specially researching how to simulate or implement Learning behavior of human beings to acquire new knowledge or skills and reorganizing existing knowledge structures to continuously improve their own performances.
DL (Deep Learning), a new research direction in the field of machine Learning, was introduced into machine Learning to make it closer to the original goal-artificial intelligence. Deep learning is the inherent law and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.
Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relation with the research in linguistics, but has important differences. Natural language processing is not a general study of natural language, but rather, is the development of computer systems, and in particular software systems therein, that can effectively implement natural language communications. Therefore, the method is a part of computer science, and natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstract, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and the like.
A training method of an image-based form restoration model and an image-based form restoration method according to an embodiment of the present disclosure are described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The execution subject of the training method of the image-based table restoration model of the present embodiment is an image-based table restoration model training apparatus, and the image processing apparatus may specifically be a hardware device, or software in a hardware device. Wherein the hardware devices such as terminal devices, servers, etc.
As shown in fig. 1, the training method of the image-based table restoration model provided in the embodiment includes the following steps:
s101, acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text.
It should be noted that the present disclosure is not limited to a specific manner of acquiring the first image vector representation of the tabular image.
Alternatively, the table image may be segmented to obtain a plurality of image slices, and then the image slices are input into the feature extraction network to obtain multi-scale image information, and the multi-scale image information is connected together to be represented as the first image vector.
Note that, the specific manner of acquiring the first text vector representation of the text and the position vector representation corresponding to the text in the form image is not limited in this disclosure.
Optionally, the table image may be input into an optical character recognition model, and the text and the position information corresponding to the text are obtained, and any text input word segmentation device is segmented (token), and the pre-trained text representation dictionary is queried, and the first text vector representation is obtained, and the pre-trained two-dimensional position information representation dictionary is queried, and the position vector representation corresponding to the text is obtained.
S102, performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation.
Alternatively, after the first image vector representation, the first text vector representation and the position vector representation are obtained, the position vector representation may be added to the first image vector representation to obtain a fused image vector representation, and the position vector representation may be added to the first text vector representation to obtain a fused text vector representation.
Further, the fused image vector representation and the fused text vector representation may be input into an encoder of the table restoration model for cross-modal multi-layer self-attention focusing, thereby obtaining a second image vector representation and a second text vector representation.
S103, acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells.
Optionally, K anchor point vectors containing position information can be selected by a selector respectively, and the historical query vector of any vector representation obtained by last training of the table restoration model is added with the anchor point vector to obtain respective query vectors of the second image vector representation and the second text vector representation.
Wherein, when training for the first time, K all 0 initialized learnable query vectors (query) may be added to the anchor vector to obtain respective query vectors of the second image vector representation and the second text vector representation.
It should be noted that, the query vector represented by the second image vector may be added to the second image vector representation, the query vector represented by the second text vector and the second text vector representation may be input into the encoder of the table restoration model to perform multi-layer self-attention, and the decoder of the table restoration model outputs the category set and the detection frame set of the table image.
And S104, adjusting model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained.
After the category set and the detection frame set are acquired, a loss function of the table reduction model can be acquired, model parameters of the table reduction model are adjusted according to the loss function, and training is continued until a final target table reduction model is obtained.
Alternatively, the loss function of the tabular reduction model may be a category loss function and a location loss function for which the tabular reduction model is obtained.
It should be noted that, the setting of the training stop condition is not limited in this disclosure, and may be selected according to actual situations.
For example, the training stop condition may be set such that the loss function of the table reduction model reaches a preset loss function threshold; the training stop condition may be set such that the number of adjustments of the model parameters of the table restoration model reaches a preset number threshold.
After the training stop condition is satisfied, a final target table recovery model may be obtained.
According to the training method of the image-based table restoration model, through obtaining a first image vector representation of a table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text, cross-modal attention is conducted on the first image vector representation, the first text vector representation and the position vector representation by the table restoration model to obtain a second image vector representation and a second text vector representation, respective query vectors of the second image vector representation and the second text vector representation are obtained, the query vectors, the second image vector representation and the second text vector representation are based, a class set and a detection frame set of the table image are output, the class set comprises rows, columns and merging cells, model parameters of the table restoration model are adjusted based on the class set and the detection frame set, and training is continued until a final target table restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.
Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure.
As shown in fig. 2, the training method of the image-based table restoration model provided in the embodiment includes the following steps:
s201, acquiring a first image vector representation of the form image, a first text vector representation of a text in the form image and a position vector representation corresponding to the text.
As a possible implementation manner, as shown in fig. 3, on the basis of the foregoing embodiment, the specific process of acquiring the first image vector representation of the table image in the foregoing step S201 includes the following steps:
s301, cutting the table image to obtain a plurality of image slices.
For example, as shown in fig. 4, when a table image is given, the table image may be split into 16 x 16 image slices, 4*4 image slices, and 8 x 8 image slices.
S302, inputting a plurality of image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and performing feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size.
Optionally, for a self-attention layer i of the plurality of self-attention layers, acquiring an image feature i output by the self-attention layer i, combining adjacent image features i to obtain a plurality of image feature groups, and inputting the image feature groups into the self-attention layer i+1 for feature extraction, wherein i is an integer greater than or equal to 1.
S303, obtaining a first image vector representation based on the image characteristics output by each layer.
After the image features output by each layer are obtained, the image features output by each layer may be combined to obtain the first image vector representation.
As a possible implementation manner, as shown in fig. 5, on the basis of the foregoing embodiment, the specific process of obtaining the first text vector representation and the position vector representation of the form image in the foregoing step S201 includes the following steps:
s501, performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text.
Optical character recognition (Optical Character Recognition, abbreviated as OCR) refers to a process of determining a shape by detecting dark and bright patterns and then translating the shape into computer text by a character recognition method.
Alternatively, the form image may be input into an optical character recognition OCR model, for example: and acquiring all texts in the form image and position information corresponding to each text by using the paddleOCR model.
For example, the form image is input into an optical character recognition OCR model, a character string of each text segment in the form image is obtained, and all texts and position information corresponding to each text are obtained. The corresponding position information of the text is represented by boxes [ x, y, w and h ], each box is a floating point number list and comprises 4 floating point numbers, wherein x is an abscissa of the text in a coordinate system, y is an ordinate of the text in the coordinate system, w is a width of the text in the x direction of the abscissa, and h is a height of the text in the y direction of the ordinate.
S502, inputting any text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining vector representations corresponding to the text token.
Optionally, any text input word segmentation device (token) may be used to segment, and the segmented character token may be queried into a pre-trained text representation (text email) dictionary, and the text token may be converted into a vectorized representation, so as to obtain a vector representation corresponding to the text token.
Wherein, all text tokens of the same segment use the same position information.
S503, according to the vector representation of the text token, obtaining a first text vector representation of any text.
In the embodiment of the application, after the vector representation of the text token is obtained, the first text vector representation of any text can be obtained according to the vector representation of the text token.
S504, querying a pre-trained two-dimensional position information representation dictionary for any position to obtain a position vector representation of the position information.
Alternatively, a position vector representation of the position information may be obtained by querying a pre-trained two-dimensional position information representation dictionary (2D position embedding) for any position.
S202, adding the position vector representation and the first image vector representation to obtain a fusion image vector representation.
After the position vector representation and the first image vector representation are obtained, the position vector representation and the first image vector representation may be added in a vector addition manner to obtain a fused image vector representation.
And S203, adding the position vector representation and the first text vector representation to obtain a fusion text vector representation.
After the position vector representation and the first text vector representation are obtained, the position vector representation and the first text vector representation may be added in a vector addition manner to obtain a fused text vector representation.
S204, the fusion image vector representation and the fusion text vector representation are input into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing, and a second image vector representation and a second text vector representation are obtained.
S205, acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells.
As a possible implementation manner, as shown in fig. 6, on the basis of the foregoing embodiment, a specific process of obtaining respective query vectors of the second image vector representation and the second text vector representation in the foregoing step S205 includes the following steps:
s601, selecting an anchor point vector containing position information for any one of the second image vector representation and the second text vector representation.
Alternatively, any vector representation may be input into the corresponding selector, and the corresponding selector selects the partially largest vector representation from any vector representation as the anchor vector of any vector representation.
S602, acquiring a historical query vector represented by any vector obtained by last training of a table restoration model.
S603, adding the anchor point vector and the historical query vector to obtain a query vector represented by any vector.
Alternatively, the anchor vector and the historical query vector may be added to obtain a query vector (query) represented by either vector.
As a possible implementation manner, as shown in fig. 7, based on the above embodiment, the specific process of outputting the category set and the detection frame set of the table image based on the query vector, the second image vector representation and the second text vector representation in the above step S205 includes the following steps:
And S701, adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation.
Alternatively, the second image vector representation and the query vector of the second image vector representation may be added to obtain the third image vector representation.
S702, adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation.
Alternatively, the second text vector representation and the query vector of the second text vector representation may be added to obtain a third text vector representation.
S703, inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention, and outputting a category set and a detection frame set of the table image.
It should be noted that, multi-layer Self-Attention may be performed by a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention), and the category set and the detection frame set of the output table image may be decoded by a Decoder (Decoder).
Wherein the category set includes rows, columns, and merging cells.
S206, based on the category set and the detection frame set, adjusting model parameters of the form restoration model, and continuing training until a final target form restoration model is obtained.
Optionally, the prediction category in the category set and the prediction detection frame in the detection frame set may be subjected to hungarian optimal matching to obtain a matching result of the prediction detection frame and the prediction type, a category loss function and a position loss function of the table reduction model are obtained according to the matching result and a label result of the table image, a loss function of the table reduction model is obtained according to the category loss function and the position loss function, and model parameters of the table reduction model are adaptively optimized based on the loss function.
Optionally, after the category loss function and the location loss function are obtained, the category loss function and the location loss function may be summed to obtain a loss function of the table reduction model.
It should be noted that, after the loss function is obtained, the model parameters of the table restoration model may be adaptively optimized based on the loss function by an adaptive estimation algorithm (Adaptive moment estimation, adam for short), and training may be continued until a final target table restoration model is obtained.
The table reduction model proposed by the application is explained below.
For example, as shown in fig. 8, the architecture of the table restoration model includes a text downsampling layer and a text vector representation layer from the lower layer to the upper layer, all text in the table image can be converted from natural language words to mathematical text vector representations, the image vector representation layer Swin-Large can convert the table image from three primary colors and other graphical representations into mathematical image vector representations, the text and the image have corresponding two-dimensional position information representations, and are respectively converted into corresponding position vector representations through a query pre-training model, before the Encoder (encodings), the position vector representations are respectively added with the text vector representations and the image vector representations through a vector addition form, the text vector fused with the position information is connected (content) with the image vector representations as an input of the Encoder, erni-layout is a table restoration model based on full-connection transporter structure pre-training, the text token and the image token in the Encoder are paid Attention by a Multi-Self-Attention mechanism (Multi-Self-Attention-layer) in the Multi-Attention mechanism, thereby the position vector representations are respectively weighted with the Decoder, the Decoder can be used as an input of the Decoder, the Decoder can Select the input of the text vector representations, the text vector representations fused with the position information are connected with the image vector representations as an input of the Decoder, the text vectors can be cross-decoded with the input of the Decoder, the text vectors are cross-layer is the input by the Decoder, and the text is cross-coded with the text vectors of the text vectors, column and merge cells) and a set of detection boxes (boxes).
According to the training method of the image-based table restoration model, through obtaining the image vector representation, the text vector representation structure and the position vector representation corresponding to the text of the table image, the table restoration model can fully utilize cross-modal information, accuracy and reliability in the training process of the image-based table restoration model are improved, and the table restoration can be subsequently carried out based on the training target table restoration model, so that accuracy of table restoration is improved.
Fig. 9 is a schematic diagram according to a seventh embodiment of the present disclosure. It should be noted that, the execution body of the image-based form restoration method in this embodiment is an image-based form restoration device, and the image-based form restoration device may specifically be a hardware device, or software in a hardware device, or the like. Wherein the hardware devices such as terminal devices, servers, etc.
As shown in fig. 9, the image-based table restoration method provided in this embodiment includes the following steps:
s901, acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text.
It should be noted that the present disclosure is not limited to a specific manner of acquiring the first image vector representation of the target table image.
Alternatively, the target table image may be segmented to obtain a plurality of image slices, and then the image slices are input into the feature extraction network to obtain multi-scale image information, and the multi-scale image information is connected together to be represented as an image vector.
Note that, the present disclosure is not limited to a specific manner of acquiring the text vector representation of the text and the position vector representation corresponding to the text in the target form image.
Optionally, the table image may be input into an optical character recognition model, to obtain a text and position information corresponding to the text, and any text input word segmentation device is segmented (token), and a pre-trained text representation dictionary is queried to obtain a text vector representation, and a pre-trained two-dimensional position information representation dictionary is queried to obtain a position vector representation corresponding to the text.
S902, inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection frames and types of each detection frame.
In the embodiment of the disclosure, after the image vector representation, the text vector representation and the position vector representation are acquired, the image vector representation, the text vector representation and the position vector representation may be input into a target table restoration model to output a recognition result corresponding to the target table image, where the recognition result includes a detection frame and a type of each detection frame.
S903, performing table restoration processing according to the recognition result and the position information of the text to obtain a target restoration table of the target table image.
The target table reduction model is a trained convergence model.
Optionally, after the identification result and the position information of the text are obtained, sorting and crossing can be performed according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image, determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of a merging cell type in the identification result, merging the first candidate cell to obtain a merging cell, obtaining a table to be filled based on the merging cell and the remaining second candidate cell in the candidate cell, filling the text into the table to be filled according to the position information of the text, and obtaining a target reduction table of the target table image.
For example, as shown in fig. 10 (a), a row type restoration result of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (b), a column type restoration result of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (c), a restoration result of the merging cell type of the target table image to be identified may be obtained by the target table restoration model, as shown in fig. 10 (d), and then the above three restoration results may be fused to obtain a complete structure of the wireless table, as shown in fig. 10 (e), and a text may be filled into the table to be filled according to the position information of the text, to obtain a target restoration table of the target table image, as shown in fig. 10 (f).
According to the image-based form restoration method disclosed by the embodiment of the disclosure, by acquiring a target form image to be identified, acquiring an image vector representation of the target form image and a text vector representation of a text in the target form image and a position vector representation corresponding to the text, inputting the image vector representation, the text vector representation and the position vector representation into a target form restoration model to output an identification result corresponding to the target form image, wherein the identification result comprises a detection frame and the type of each detection frame, and performing form restoration processing according to the identification result and the position information of the text to obtain a target restoration form of the target form image. According to the method and the device for restoring the target table image, the table structure of the target table image to be identified is restored through the target table restoring model, accuracy of restoring the table is guaranteed, and therefore the utilization rate of the target table image to be identified is improved.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
Corresponding to the training method of the image-based table restoration model provided in the foregoing several embodiments, an embodiment of the present disclosure further provides a training device of the image-based table restoration model, and since the training device of the image-based table restoration model provided in the embodiment of the present disclosure corresponds to the training method of the image-based table restoration model provided in the foregoing several embodiments, implementation of the training method of the image-based table restoration model is also applicable to the training device of the image-based table restoration model provided in the embodiment, and will not be described in detail in the present embodiment.
FIG. 11 is a schematic diagram of a training apparatus for image-based tabular restoration model in accordance with one embodiment of the present disclosure.
As shown in fig. 11, the training device 1000 for the image-based table restoration model includes: an acquisition module 1010, a cross-modality module 1020, an output module 1030, and an adjustment module 1040, wherein:
an obtaining module 1010, configured to obtain a first image vector representation of the table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text;
A cross-modal module 1020 configured to perform cross-modal attention on the first image vector representation, the first text vector representation, and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation;
an output module 1030, configured to obtain respective query vectors of the second image vector representation and the second text vector representation, and output a category set and a detection frame set of the table image based on the query vectors, the second image vector representation, and the second text vector representation, where the category set includes rows, columns, and merging cells;
and the adjusting module 1040 is configured to adjust model parameters of the table reduction model based on the category set and the detection frame set, and continue training until a final target table reduction model is obtained.
Wherein, output module 1030 is further configured to:
selecting an anchor point vector containing position information aiming at any one of the second image vector representation and the second text vector representation;
acquiring a historical query vector represented by any vector obtained by last training of the table reduction model;
And adding the anchor point vector and the historical query vector to obtain the query vector represented by any vector.
Wherein, output module 1030 is further configured to:
and inputting any vector representation into a corresponding selector, and selecting the vector representation with the largest part from the any vector representations by the corresponding selector as an anchor point vector of any vector representation.
Wherein, cross-modality module 1020 is further configured to:
adding the position vector representation to the first image vector representation to obtain a fused image vector representation;
adding the position vector representation to the first text vector representation to obtain a fused text vector representation;
and inputting the fused image vector representation and the fused text vector representation into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing to obtain the second image vector representation and the second text vector representation.
Wherein, output module 1030 is further configured to:
adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation;
adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation;
And inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention focusing, and outputting a category set and a detection frame set of the table image.
Wherein, the obtaining module 1010 is further configured to:
cutting the table image to obtain a plurality of image fragments;
inputting the image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and carrying out feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size;
the first image vector representation is derived based on the image features output by each layer.
Wherein the apparatus 1000 is further configured to:
aiming at a self-attention layer i in a plurality of self-attention layers, acquiring an image feature i output by the self-attention layer i, and combining adjacent image features i to obtain a plurality of image feature groups;
and inputting the image feature group into a self-attention layer i+1 for feature extraction, wherein i is an integer greater than or equal to 1.
Wherein, the obtaining module 1010 is further configured to:
Performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text;
for any text, inputting the text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining a vector representation corresponding to the text token;
obtaining a first text vector representation of any text according to the vector representation of the text token;
and querying a pre-trained two-dimensional position information representation dictionary for any position, and obtaining a position vector representation of the position information.
Wherein, adjustment module 1040 is further configured to:
performing Hungary optimal matching on the prediction category in the category set and the prediction detection frame in the detection frame set to obtain a matching result of the prediction detection frame and the prediction type;
obtaining a category loss function and a position loss function of the table restoration model according to the matching result and the label result of the table image,
obtaining a loss function of the table reduction model according to the category loss function and the position loss function;
And based on the loss function, adaptively optimizing model parameters of the table restoration model.
According to the training device of the image-based table restoration model, through obtaining a first image vector representation of a table image and a first text vector representation of a text in the table image and a position vector representation corresponding to the text, cross-modal attention is conducted on the first image vector representation, the first text vector representation and the position vector representation by the table restoration model to obtain a second image vector representation and a second text vector representation, respective query vectors of the second image vector representation and the second text vector representation are obtained, the query vectors, the second image vector representation and the second text vector representation are based, a class set and a detection frame set of the table image are output, the class set comprises rows, columns and merging cells, model parameters of the table restoration model are adjusted based on the class set and the detection frame set, and training is continued until a final target table restoration model is obtained. Therefore, the method and the device acquire the converged target table reduction model, can reduce the table based on the target table reduction model, and improve the accuracy of reducing the table.
Corresponding to the image-based form restoration method provided in the above embodiments, an embodiment of the present disclosure further provides an image-based form restoration device, and since the image-based form restoration device provided in the embodiment of the present disclosure corresponds to the image-based form restoration method provided in the above embodiments, implementation of the image-based form restoration method is also applicable to the image-based form restoration device provided in the embodiment, and will not be described in detail in the embodiment.
Fig. 12 is a schematic structural diagram of an image-based table restoring apparatus according to an embodiment of the present disclosure.
As shown in fig. 12, the image-based table restoring apparatus 1200 includes: a vector representation module 1210, an input module 1220, and a reduction module 1230. Wherein:
a vector representation module 1210, configured to obtain a target table image to be identified, and obtain an image vector representation of the target table image, a text vector representation of a text in the target table image, and a position vector representation corresponding to the text;
an input module 1220, configured to input the image vector representation, the text vector representation, and the position vector representation into a target table restoration model, so as to output a recognition result corresponding to the target table image, where the recognition result includes a detection frame and a type of each detection frame;
A restoring module 1230, configured to perform a table restoring process according to the recognition result and the location information of the text, to obtain a target restoring table of the target table image, where the target table restoring model is a model obtained by using the training method according to any one of claims 1-9.
Wherein, the reduction module 1230 is further configured to:
sequencing and crossing according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image;
determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of the merging cell type in the identification result, and carrying out cell merging on the first candidate cell to obtain the merging cell;
obtaining a form to be filled based on the merging cell and the remaining second candidate cells in the candidate cells;
and filling text into the to-be-filled form according to the position information of the text to obtain a target reduction form of the target form image.
According to the image-based form restoration device of the embodiment of the disclosure, by acquiring a target form image to be identified, acquiring an image vector representation of the target form image and a text vector representation of a text in the target form image and a position vector representation corresponding to the text, inputting the image vector representation, the text vector representation and the position vector representation into a target form restoration model to output an identification result corresponding to the target form image, wherein the identification result comprises a detection frame and the type of each detection frame, and performing form restoration processing according to the identification result and the position information of the text to obtain a target restoration form of the target form image. According to the method and the device for restoring the target table image, the table structure of the target table image to be identified is restored through the target table restoring model, accuracy of restoring the table is guaranteed, and therefore the utilization rate of the target table image to be identified is improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 13 illustrates a schematic block diagram of an example electronic device 1300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 13, the apparatus 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data required for the operation of the device 1300 can also be stored. The computing unit 1301, the ROM 1302, and the RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.
Various components in device 1300 are connected to I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, or the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, etc.; and a communication unit 1309 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1301 performs the respective methods and processes described above, such as a training method of an image-based form restoration model or an image-based form restoration method. For example, in some embodiments, the training method of the image-based form restoration model or the image-based form restoration method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1300 via the ROM 1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the above-described training method of the image-based form restoration model or the image-based form restoration method may be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured to perform the training method of the image-based form restoration model or the image-based form restoration method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the training method of an image-based form restoration model or the image-based form restoration method as described above.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (22)

1. A method of training an image-based tabular reduction model, wherein the method comprises:
acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;
Performing cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by a table reduction model to obtain a second image vector representation and a second text vector representation;
acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells;
based on the category set and the detection frame set, adjusting model parameters of the form restoration model, and continuing training until a final target form restoration model is obtained;
wherein the cross-modal focus by the table reduction model on the first image vector representation, the first text vector representation, and the position vector representation further comprises:
adding the position vector representation to the first image vector representation to obtain a fused image vector representation;
adding the position vector representation to the first text vector representation to obtain a fused text vector representation;
And inputting the fused image vector representation and the fused text vector representation into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing to obtain the second image vector representation and the second text vector representation.
2. The method of claim 1, wherein the obtaining the respective query vectors of the second image vector representation and the second text vector representation comprises:
selecting an anchor point vector containing position information aiming at any one of the second image vector representation and the second text vector representation;
acquiring a historical query vector represented by any vector obtained by last training of the table reduction model;
and adding the anchor point vector and the historical query vector to obtain the query vector represented by any vector.
3. The method of claim 2, wherein the selecting the anchor vector represented by any one of the vectors comprises:
and inputting any vector representation into a corresponding selector, and selecting the vector representation with the largest part from the any vector representations by the corresponding selector as an anchor point vector of any vector representation.
4. The method of any of claims 1-3, wherein the outputting the set of categories and the set of detection boxes of the tabular image based on the query vector, the second image vector representation, and the second text vector representation comprises:
adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation;
adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation;
and inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention focusing, and outputting a category set and a detection frame set of the table image.
5. A method according to any of claims 1-3, wherein the acquiring a first image vector representation of the tabular image comprises:
cutting the table image to obtain a plurality of image fragments;
inputting the image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and carrying out feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size;
The first image vector representation is derived based on the image features output by each layer.
6. The method of claim 5, wherein the method further comprises:
for a self-attention layer of a plurality of self-attention layersiAcquiring the self-attention layeriOutput image featuresiAnd for adjacent image featuresiCombining to obtain a plurality of image feature groups;
inputting the set of image features into a self-attention layeri+1, wherein,iis an integer greater than or equal to 1.
7. A method according to any of claims 1-3, wherein obtaining the first text vector representation and the position vector representation of the tabular image comprises:
performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text;
for any text, inputting the text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining a vector representation corresponding to the text token;
obtaining a first text vector representation of any text according to the vector representation of the text token;
and querying a pre-trained two-dimensional position information representation dictionary for any position, and obtaining a position vector representation of the position information.
8. The method of any of claims 1-3, wherein the adjusting model parameters of the tabular reduction model based on the set of categories and the set of detection boxes comprises:
performing Hungary optimal matching on the prediction category in the category set and the prediction detection frame in the detection frame set to obtain a matching result of the prediction detection frame and the prediction category;
obtaining a category loss function and a position loss function of the table restoration model according to the matching result and the label result of the table image,
obtaining a loss function of the table reduction model according to the category loss function and the position loss function;
and based on the loss function, adaptively optimizing model parameters of the table restoration model.
9. A method of image-based table restoration, wherein the method comprises:
acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text;
inputting the image vector representation, the text vector representation and the position vector representation into a target table restoration model to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection boxes and types of each detection box;
And carrying out table reduction processing according to the recognition result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-8.
10. The method of claim 9, wherein the performing a table reduction process according to the recognition result and the location information of the text to obtain the target reduction table of the target table image includes:
sequencing and crossing according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image;
determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of the merging cell type in the identification result, and carrying out cell merging on the first candidate cell to obtain the merging cell;
obtaining a form to be filled based on the merging cell and the remaining second candidate cells in the candidate cells;
and filling text into the to-be-filled form according to the position information of the text to obtain a target reduction form of the target form image.
11. An image-based form restoration model training apparatus, wherein the apparatus comprises:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first image vector representation of a table image, a first text vector representation of a text in the table image and a position vector representation corresponding to the text;
the cross-modal module is used for carrying out cross-modal attention on the first image vector representation, the first text vector representation and the position vector representation by the table reduction model to obtain a second image vector representation and a second text vector representation;
the output module is used for acquiring respective query vectors of the second image vector representation and the second text vector representation, and outputting a category set and a detection frame set of the table image based on the query vectors, the second image vector representation and the second text vector representation, wherein the category set comprises rows, columns and merging cells;
the adjusting module is used for adjusting the model parameters of the form restoration model based on the category set and the detection frame set, and continuing training until a final target form restoration model is obtained;
wherein, cross-modal module is further configured to:
Adding the position vector representation to the first image vector representation to obtain a fused image vector representation;
adding the position vector representation to the first text vector representation to obtain a fused text vector representation;
and inputting the fused image vector representation and the fused text vector representation into an encoder of the table restoration model to perform cross-mode multi-layer self-attention focusing to obtain the second image vector representation and the second text vector representation.
12. The apparatus of claim 11, wherein the output module is further configured to:
selecting an anchor point vector containing position information aiming at any one of the second image vector representation and the second text vector representation;
acquiring a historical query vector represented by any vector obtained by last training of the table reduction model;
and adding the anchor point vector and the historical query vector to obtain the query vector represented by any vector.
13. The apparatus of claim 12, wherein the output module is further to:
and inputting any vector representation into a corresponding selector, and selecting the vector representation with the largest part from the any vector representations by the corresponding selector as an anchor point vector of any vector representation.
14. The apparatus of any of claims 11-13, wherein the output module is further to:
adding the second image vector representation and the query vector of the second image vector representation to obtain a third image vector representation;
adding the second text vector representation and the query vector of the second text vector representation to obtain a third text vector representation;
and inputting the third image vector representation and the third text vector representation into an encoder of the table restoration model for multi-layer self-attention focusing, and outputting a category set and a detection frame set of the table image.
15. The apparatus of any of claims 11-13, wherein the acquisition module is further to:
cutting the table image to obtain a plurality of image fragments;
inputting the image slices into a feature extraction network, wherein the feature extraction network comprises a plurality of self-attention layers connected in series, and carrying out feature extraction layer by the plurality of self-attention layers to obtain image features of each layer, wherein the sensing areas of the plurality of self-attention layers are different in size;
the first image vector representation is derived based on the image features output by each layer.
16. The apparatus of claim 15, wherein the apparatus is further configured to:
for a self-attention layer of a plurality of self-attention layersiAcquiring the self-attention layeriOutput image featuresiAnd for adjacent image featuresiCombining to obtain a plurality of image feature groups;
inputting the set of image features into a self-attention layeri+1, wherein,iis an integer greater than or equal to 1.
17. The apparatus of any of claims 11-13, wherein the acquisition module is further to:
performing optical character recognition OCR on the table image to acquire all texts in the table image and position information corresponding to each text;
for any text, inputting the text into a word segmentation device for segmentation, inquiring a pre-trained text representation dictionary for segmented character token, and obtaining a vector representation corresponding to the text token;
obtaining a first text vector representation of any text according to the vector representation of the text token;
and querying a pre-trained two-dimensional position information representation dictionary for any position, and obtaining a position vector representation of the position information.
18. The apparatus of any of claims 11-13, wherein the adjustment module is further to:
performing Hungary optimal matching on the prediction category in the category set and the prediction detection frame in the detection frame set to obtain a matching result of the prediction detection frame and the prediction category;
obtaining a category loss function and a position loss function of the table restoration model according to the matching result and the label result of the table image,
obtaining a loss function of the table reduction model according to the category loss function and the position loss function;
and based on the loss function, adaptively optimizing model parameters of the table restoration model.
19. An image-based form restoration apparatus, wherein the apparatus is further configured to:
the vector representation module is used for acquiring a target table image to be identified, and acquiring an image vector representation of the target table image, a text vector representation of a text in the target table image and a position vector representation corresponding to the text;
the input module is used for inputting the image vector representation, the text vector representation and the position vector representation into a target table reduction model so as to output a recognition result corresponding to the target table image, wherein the recognition result comprises detection frames and types of each detection frame;
And the reduction module is used for carrying out table reduction processing according to the identification result and the position information of the text to obtain a target reduction table of the target table image, wherein the target table reduction model is a model obtained by adopting the training method according to any one of claims 1-8.
20. The apparatus of claim 19, wherein the reduction module is further to:
sequencing and crossing according to a first detection frame of a row type and a second detection frame of a column type in the identification result to obtain all candidate cells of the table image;
determining a first candidate cell belonging to the third detection frame in the candidate cells according to a third detection frame of the merging cell type in the identification result, and carrying out cell merging on the first candidate cell to obtain the merging cell;
obtaining a form to be filled based on the merging cell and the remaining second candidate cells in the candidate cells;
and filling text into the to-be-filled form according to the position information of the text to obtain a target reduction form of the target form image.
21. An electronic device comprising a processor and a memory;
Wherein the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for implementing the method according to any of claims 1-8 or 9-10.
22. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-8 or 9-10.
CN202211735420.5A 2022-12-30 2022-12-30 Training method of form restoration model based on image and form restoration method Active CN116152833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211735420.5A CN116152833B (en) 2022-12-30 2022-12-30 Training method of form restoration model based on image and form restoration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211735420.5A CN116152833B (en) 2022-12-30 2022-12-30 Training method of form restoration model based on image and form restoration method

Publications (2)

Publication Number Publication Date
CN116152833A CN116152833A (en) 2023-05-23
CN116152833B true CN116152833B (en) 2023-11-24

Family

ID=86372897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211735420.5A Active CN116152833B (en) 2022-12-30 2022-12-30 Training method of form restoration model based on image and form restoration method

Country Status (1)

Country Link
CN (1) CN116152833B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452707B (en) * 2023-06-20 2023-09-12 城云科技(中国)有限公司 Text generation method and device based on table and application of text generation method and device
CN117475458A (en) * 2023-12-28 2024-01-30 深圳智能思创科技有限公司 Table structure restoration method, apparatus, device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN112800848A (en) * 2020-12-31 2021-05-14 中电金信软件有限公司 Structured extraction method, device and equipment of information after bill identification
CN113033534A (en) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 Method and device for establishing bill type identification model and identifying bill type
CN113239818A (en) * 2021-05-18 2021-08-10 上海交通大学 Cross-modal information extraction method of tabular image based on segmentation and graph convolution neural network
CN113723094A (en) * 2021-09-03 2021-11-30 北京有竹居网络技术有限公司 Text processing method, model training method, device and storage medium
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium
CN114241499A (en) * 2021-12-17 2022-03-25 深圳壹账通智能科技有限公司 Table picture identification method, device and equipment and readable storage medium
CN114463768A (en) * 2022-02-11 2022-05-10 北京有竹居网络技术有限公司 Form recognition method and device, readable medium and electronic equipment
CN114821255A (en) * 2022-04-20 2022-07-29 北京百度网讯科技有限公司 Method, apparatus, device, medium and product for fusion of multimodal features
CN115205884A (en) * 2022-07-26 2022-10-18 广州欢聚时代信息科技有限公司 Bill information extraction method and device, equipment, medium and product thereof
CN115273112A (en) * 2022-07-29 2022-11-01 北京金山数字娱乐科技有限公司 Table identification method and device, electronic equipment and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258195A1 (en) * 2010-01-15 2011-10-20 Girish Welling Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements
US11847806B2 (en) * 2021-01-20 2023-12-19 Dell Products, L.P. Information extraction from images using neural network techniques and anchor words

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860257A (en) * 2020-07-10 2020-10-30 上海交通大学 Table identification method and system fusing multiple text features and geometric information
CN112800848A (en) * 2020-12-31 2021-05-14 中电金信软件有限公司 Structured extraction method, device and equipment of information after bill identification
CN113033534A (en) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 Method and device for establishing bill type identification model and identifying bill type
CN113239818A (en) * 2021-05-18 2021-08-10 上海交通大学 Cross-modal information extraction method of tabular image based on segmentation and graph convolution neural network
CN113723094A (en) * 2021-09-03 2021-11-30 北京有竹居网络技术有限公司 Text processing method, model training method, device and storage medium
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium
CN114241499A (en) * 2021-12-17 2022-03-25 深圳壹账通智能科技有限公司 Table picture identification method, device and equipment and readable storage medium
CN114463768A (en) * 2022-02-11 2022-05-10 北京有竹居网络技术有限公司 Form recognition method and device, readable medium and electronic equipment
CN114821255A (en) * 2022-04-20 2022-07-29 北京百度网讯科技有限公司 Method, apparatus, device, medium and product for fusion of multimodal features
CN115205884A (en) * 2022-07-26 2022-10-18 广州欢聚时代信息科技有限公司 Bill information extraction method and device, equipment, medium and product thereof
CN115273112A (en) * 2022-07-29 2022-11-01 北京金山数字娱乐科技有限公司 Table identification method and device, electronic equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer;Xiang Shuai等;《 International Conference on Multimedia Modeling》;443-454 *
融合边特征与注意力的表格结构识别模型;吕学强等;《计算机应用》;1-10 *

Also Published As

Publication number Publication date
CN116152833A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN115035538B (en) Training method of text recognition model, and text recognition method and device
CN116152833B (en) Training method of form restoration model based on image and form restoration method
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
EP4131076A1 (en) Serialized data processing method and device, and text processing method and device
CN115063875B (en) Model training method, image processing method and device and electronic equipment
CN113792854A (en) Model training and word stock establishing method, device, equipment and storage medium
CN114942984B (en) Pre-training and image-text retrieval method and device for visual scene text fusion model
CN114820871B (en) Font generation method, model training method, device, equipment and medium
CN113392253B (en) Visual question-answering model training and visual question-answering method, device, equipment and medium
CN114863437B (en) Text recognition method and device, electronic equipment and storage medium
CN113361578A (en) Training method and device of image processing model, electronic equipment and storage medium
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN113553412A (en) Question and answer processing method and device, electronic equipment and storage medium
CN115861462A (en) Training method and device for image generation model, electronic equipment and storage medium
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN114861637A (en) Method and device for generating spelling error correction model and method and device for spelling error correction
CN112507705B (en) Position code generation method and device and electronic equipment
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
CN113360683A (en) Method for training cross-modal retrieval model and cross-modal retrieval method and device
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model
CN112632227A (en) Resume matching method, resume matching device, electronic equipment, storage medium and program product
CN115761839A (en) Training method of human face living body detection model, human face living body detection method and device
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN115359323A (en) Image text information generation method and deep learning model training method
CN113361522B (en) Method and device for determining character sequence and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant