CN113849623A

CN113849623A - Text visual question answering method and device

Info

Publication number: CN113849623A
Application number: CN202111186856.9A
Authority: CN
Inventors: 曾港艳; 张远; 周宇; 杨晓萌; 王伟平
Original assignee: Institute of Information Engineering of CAS; Communication University of China
Current assignee: Institute of Information Engineering of CAS; Communication University of China
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2021-12-28
Anticipated expiration: 2041-10-12
Also published as: CN113849623B

Abstract

A text visual question answering method and a device are provided, and the method comprises the following steps: obtaining visually enhanced character representations from the image through a character representation obtaining module in the text visual question-answering model, wherein the character representation obtaining module comprises an OCR module and a TVS module; obtaining semantic-oriented object representations from the images through an object representation obtaining module in the text visual question-answering model; obtaining a question representation from a question sentence corresponding to the image through a pre-trained language model in the text visual question-answering model; obtaining, by a first multimodal interaction network module in a text visual question-answering model, a text representation enhancement feature corresponding to the visually enhanced text representation and a current predictive decoding vector corresponding to a previous decoding vector from the visually enhanced text representation, the semantically oriented object representation, the question representation, and the previous decoding vector; and obtaining a predicted answer corresponding to the question sentence from the character representation enhanced features and the current prediction decoding vector through an answer prediction module in the text visual question-answer model.

Description

Text visual question answering method and device

Technical Field

The present application relates generally to the field of information processing technology, and more particularly, to a method and apparatus for training a text visual question-answer model and a text visual question-answer (TextVQA) method and apparatus.

Background

At present, since the universal visual question-answering (VQA) method cannot deal with the defect of character information in images, developers propose a text visual question-answering (TextVQA) technology. However, TextVQA has a great challenge in order to answer a question related to a text in an image, considering information of a plurality of modalities such as a visual scene and a text and a relationship thereof at the same time. Therefore, the mainstream method is to introduce an external Optical Character Recognition (OCR) module as a separate preprocessing module, that is, after the OCR module detects, recognizes and acquires text information from an image, the acquired text information is transmitted to the VQA model together with other modal information such as questions and vision, so as to improve the TextVQA performance.

However, the prior art uses the OCR module as a separate preprocessing module, and does not participate in the optimization of the whole model, which makes the TextVQA performance greatly affected by the OCR precision. The method is characterized by the following two error accumulation propagation phenomena: (1) OCR errors cause errors in direct semantic coding of characters, and cause deviation in an interactive reasoning process of multi-modal information, so that accurate answers cannot be positioned; (2) even where the reasoning and positioning answers are correct, OCR errors can still result in the answer that is ultimately "copied" from the OCR results being incorrect.

In addition, because the visual modal information is mainly represented by the visual features of the object detected in the image, semantic intervals exist when the visual modal information interacts with characters and problem modalities, so that multi-modal information cannot be effectively fused.

Disclosure of Invention

In order to at least solve the above problems in the prior art, the present invention provides a method and an apparatus for training a text visual question-answering model, and a method and an apparatus for text visual question-answering.

The invention provides a training method of a text visual question-answering model, wherein the text visual question-answering model comprises a pre-trained language model, a character representation acquisition module, an object representation acquisition module, a first multi-modal interaction network module and an answer prediction module, and the training method comprises the following steps: processing the first training data set by a character representation acquisition module to obtain a predicted semantic feature corresponding to an image character region in the first training data set and a predicted character recognition result, calculating a first loss function according to the predicted semantic feature and a real semantic feature corresponding to the image character region, calculating a second loss function according to the predicted character recognition result and a real character recognition result corresponding to the image character region, and adjusting a model parameter of the character representation acquisition module by using the first loss function and the second loss function to train the character representation acquisition module; processing the second training data set by an object representation acquisition module to obtain a predicted object class embedding vector of the object in the second training data set, calculating a third loss function according to the predicted object class embedding vector and a real object class vector of the object, and adjusting a model parameter of the object representation acquisition module according to the third loss function to train the object representation acquisition module; the method comprises the steps that a trained character representation acquisition module acquires visually enhanced character representations aiming at a third training data set, a trained object representation acquisition module acquires semantically oriented object representations aiming at the third training data set, a pre-trained language model acquires problem representations aiming at problem sentences, and a first multi-mode interactive network module and an answer prediction module are trained according to the visually enhanced character representations, the semantically oriented object representations, the problem representations and a previous decoding vector.

Optionally, the word representation obtaining module comprises an Optical Character Recognition (OCR) module and a word dependent visual semantic mapping network (TVS) module, wherein the step of processing the first training data set by the word representation obtaining module to obtain the predicted semantic features and the predicted word recognition result comprises: performing character correction on the image character area detected by the OCR module from the first training data set by a correction module in the TVS module; coding the image character region with corrected characters by a coding module in the TVS module to obtain visual characteristics; deriving, by a first linear layer in an intermediate semantic module in the TVS module, the predicted semantic feature from the visual feature and deriving, by a second linear layer in the intermediate semantic module, semantic information corresponding to the predicted semantic feature from the predicted semantic feature; and obtaining the predicted character recognition result by a decoding module in the TVS module according to the semantic information and the visual features.

Optionally, the object representation obtaining module includes a convolutional neural network, a region candidate network, a region of interest feature pooling module, and an object class embedding vector prediction module, wherein the step of processing the second training data set by the object representation obtaining module to obtain the predicted object class embedding vector includes: performing convolution operation on the second training data set by a convolution neural network to obtain image visual characteristics; obtaining object candidate regions from the image visual features by a region candidate network; processing the object candidate region and the image visual features by a region-of-interest feature pooling module to obtain visual features of the object candidate region; processing, by an object class embedding vector prediction module, visual features of the object candidate region to obtain the object class embedding vector.

Optionally, the step of obtaining, by the trained text representation obtaining module, a visually enhanced text representation for the third training data set comprises: detecting a text region containing text from the training image in the third training data set by an OCR module; carrying out character recognition on the detected character area through an OCR module and coding a recognition result to obtain a first semantic feature; extracting, by a TVS module, a second semantic feature directly derived from a visual feature of a detected text region from the text region; combining the first semantic feature and the second semantic feature to obtain the visually enhanced textual representation.

Optionally, the object representation acquiring module further comprises an object detection head module and an object property prediction module, wherein the step of acquiring, by the trained object representation acquiring module, a semantically oriented object representation for the third training data set comprises: performing a convolution operation on the training images from the third training data set by a convolutional neural network to obtain image visual features; obtaining object candidate regions from the image visual features by a region candidate network; processing the object candidate region and the image visual features by a region-of-interest feature pooling module to obtain visual features of the object candidate region; processing the visual features of the object candidate region by an object detection head module to determine a corrected object region of the object candidate region and the visual features and the position features of the object region; processing the visual features of the object candidate region by an object class embedding vector prediction module to obtain an object class embedding vector; combining the visual features of the object region, the positional features of the object region, and the object class embedding vector to obtain the semantically-oriented object representation.

Optionally, the step of training the first multimodal interaction network module and the answer prediction module according to the visually enhanced character representation, the semantically oriented object representation, the question representation, and the previously decoded vector comprises: inputting the visually enhanced word representations, the semantically oriented object representations, the problem representations, and a previously decoded vector to a first multimodal interaction network module to obtain word representation enhancement features corresponding to the visually enhanced word representations and a current predictive decoding vector corresponding to the previously decoded vector; inputting the word representation enhanced features and the current predictive decoding vector to an answer prediction module to obtain a first prediction score and a second prediction score corresponding to the question statement; and calculating a fourth loss function according to the first prediction score and the second prediction score and the first real score and the second real score of the standard answer of the question sentence, and training the first multi-modal interaction network module and the answer prediction module by adjusting the model parameters of the first multi-modal interaction network module and the answer prediction module according to the fourth loss function.

Optionally, the text visual question-answering model further includes a context-aware-based answer modification module, and the training method further includes: training a context-aware answer modification module with a fourth training data set, wherein the fourth training data set includes a standard answer for each question statement provided by a predetermined data set and recognition results of a plurality of external OCR modules, wherein the predetermined data set includes the question statement and an input image corresponding to each question statement, and wherein the recognition results of the plurality of external OCR modules are: and when the output of the answer prediction module aiming at each question sentence points to the image characters in the corresponding input image, the plurality of external OCR modules process the image characters to obtain a recognition result.

Optionally, the answer modification module based on context awareness includes a second multi-module interaction network module and a linear mapping layer, wherein the step of training the answer modification module based on context awareness in the text visual question-answering model by using a fourth training data set includes: calculating a standard answer of each question sentence and an intersection ratio between recognition results of each of the plurality of external OCR modules recognizing the input image corresponding to each question; inputting a recognition result of each OCR module in the plurality of external OCR modules for recognizing the input image corresponding to each question and context information thereof into a second multi-modal interaction network module for fusion interaction; inputting an output vector of the second multi-module interaction network module corresponding to each recognition result into the linear mapping layer to obtain a correlation score of each recognition result; and training the second multi-modal interactive network module by utilizing the correlation score of each recognition result and the intersection and comparison of each recognition result to adjust the model parameters of the second multi-modal interactive network module.

The second aspect of the present invention provides a text visual question-answering method, wherein the text visual question-answering model includes a character representation obtaining module, an object representation obtaining module, a pre-trained language model, a first multi-modal interaction network module, and an answer prediction module, and the text visual question-answering method includes: obtaining visually enhanced text representations from an image by a text representation acquisition module, wherein the text representation acquisition module comprises an Optical Character Recognition (OCR) module and a text-dependent visual semantic mapping network (TVS) module; obtaining semantically oriented object representations from the image by an object representation obtaining module; obtaining a problem representation from a problem statement corresponding to the image through a pre-trained language model; obtaining, by a first multimodal interaction network module, from the visually enhanced text representations, the semantically oriented object representations, the problem representations, and a previous decoded vector, text representation enhancement features corresponding to the visually enhanced text representations and a current predictive decoded vector corresponding to the previous decoded vector; and obtaining a predicted answer corresponding to the question sentence from the text characterization enhanced features and the current predictive decoding vector through an answer prediction module.

Optionally, the step of obtaining the visually enhanced text representation from the image by a text representation obtaining module comprises: detecting a text area containing text from the image through an OCR module; carrying out character recognition on the detected character area through an OCR module and coding a recognition result to obtain a first semantic feature; extracting, by a TVS module, a second semantic feature directly derived from a visual feature of a detected text region from the text region; combining the first semantic feature and the second semantic feature to obtain the visually enhanced textual representation.

Optionally, the TVS module includes a text image correction module, an encoding module, and an intermediate semantic module, wherein the step of extracting, by using the TVS module, a second semantic feature directly derived from the visual feature of the text region from the detected text region includes: the character image correction module corrects the characters of the character area; coding the corrected text region by a coding module to obtain the visual feature; predicting, by an intermediate semantic module, a second semantic feature from the visual feature.

Optionally, the object representation obtaining module includes a convolutional neural network, a region candidate network, a region-of-interest feature pooling module, an object detection head module, and an object class embedding vector prediction module, wherein the step of obtaining the semantically oriented object representation from the image by the object representation obtaining module includes: performing a convolution operation on the image by a convolution neural network to obtain image visual features; obtaining object candidate regions from the image visual features by a region candidate network; processing the object candidate region and the image visual features by a region-of-interest feature pooling module to obtain visual features of the object candidate region; processing the visual features of the object candidate region by an object detection head module to determine a corrected object region of the object candidate region and the visual features and the position features of the object region; processing, by an object class embedding vector prediction module, visual features of the object candidate region to obtain the object class embedding vector; combining the visual features of the object region, the positional features of the object region, and the object class embedding vector to obtain the semantically-oriented object representation.

Optionally, the step of obtaining, by the first multimodal interactive network module, the visually enhanced text representation, the semantically oriented object representation, the problem representation, and a previous decoding vector, the text representation enhancement feature and the current predictive decoding vector comprises: performing intra-modal and inter-modal interactions on the visually enhanced text representations, the semantically oriented object representations, the problem representations, and the previously decoded vectors using a first multi-modal interaction network module to obtain the text representation enhancement features and the current predictive decoding vector.

Optionally, the answer prediction module includes a bilinear interactive network, a linear classification layer, and a determination module, wherein the step of obtaining the predicted answer from the text characterization enhancement feature and the current predictive decoding vector through the answer prediction module includes: processing the text characterization enhancement features and the current predictive decoding vector by utilizing a bilinear interactive network to determine a first score when the predictive answer is an image text condition in the image; performing a linear classification process on the current predictive decoding vector using a linear classification layer to determine a second score when the predicted answer is a dictionary word in a fixed dictionary; determining, with a determination module, one of the image text and the dictionary text corresponding to a maximum value of the first score and the second score as the predicted answer.

Optionally, the step of processing the text characterization enhancement features and the current predictive decoding vector by using a bilinear interactive network to determine the first score includes: performing linear processing on the character representation enhancement features by utilizing a first linear layer in the double-linear interaction network to obtain a first linear processing result;

performing linear processing on the current predictive decoding vector by utilizing a second linear layer in the double-linear interaction network to obtain a second linear processing result; calculating a similarity between the first linear processing result and the second linear processing result as the first score.

Optionally, the text visual question-answering model further includes a context-aware-based answer modification module (CRM), and the text visual question-answering method further includes: and when the image characters are determined as the predicted answer, correcting the predicted answer by using CRM to obtain the corrected predicted answer.

Optionally, the step of modifying the predicted answer with CRM to obtain a modified predicted answer includes: processing a plurality of pre-stored external OCR recognition results and the predicted answer to obtain a score value corresponding to each of the plurality of external OCR recognition results; determining an external OCR recognition result having a highest score value among the plurality of external OCR recognition results as the corrected predicted answer.

Optionally, the text visual question-answering model is trained by using the training method described above.

A third aspect of the present invention provides a training device for a text visual question-answer model, where the text visual question-answer model includes a pre-trained language model, a character representation acquisition module, an object representation acquisition module, a first multi-modal interaction network module, and an answer prediction module, and the training device includes: a first training unit configured to process a first training data set through a character representation acquisition module to obtain a predicted semantic feature corresponding to an image character region in the first training data set and a predicted character recognition result, calculate a first loss function according to the predicted semantic feature and a real semantic feature corresponding to the image character region, calculate a second loss function according to the predicted character recognition result and a real character recognition result corresponding to the image character region, and train the character representation acquisition module by adjusting a model parameter of the character representation acquisition module using the first loss function and the second loss function; a second training unit configured to obtain a predicted object class embedding vector of an object in a second training data set by processing the second training data set by an object representation acquisition module, calculate a third loss function according to the predicted object class embedding vector and a real object class vector of the object, and train the object representation acquisition module by adjusting a model parameter of the object representation acquisition module according to the third loss function; the third training unit is configured to acquire visually enhanced character representations for a third training data set through the trained character representation acquisition module, acquire semantically-oriented object representations for the third training data set through the trained object representation acquisition module, acquire problem representations for problem sentences through a pre-trained language model, and train the first multi-modal interaction network module and the answer prediction module according to the visually enhanced character representations, the semantically-oriented object representations, the problem representations and a previous decoding vector.

Optionally, the word representation obtaining module comprises an Optical Character Recognition (OCR) module and a word-dependent visual semantic mapping network (TVS) module, wherein the first training unit obtains the predicted semantic features and the predicted word recognition result by: performing character correction on the image character area detected by the OCR module from the first training data set by a correction module in the TVS module; coding the image character region with corrected characters by a coding module in the TVS module to obtain visual characteristics; deriving, by a first linear layer in an intermediate semantic module in the TVS module, the predicted semantic feature from the visual feature and deriving, by a second linear layer in the intermediate semantic module, semantic information corresponding to the predicted semantic feature from the predicted semantic feature; and obtaining the predicted character recognition result by a decoding module in the TVS module according to the semantic information and the visual features.

Optionally, the object characterization acquisition module includes a convolutional neural network, a region candidate network, a region of interest feature pooling module, and an object class embedding vector prediction module, where the second training unit obtains the predicted object class embedding vector by: performing convolution operation on the second training data set by a convolution neural network to obtain image visual characteristics; obtaining object candidate regions from the image visual features by a region candidate network; processing the object candidate region and the image visual features by a region-of-interest feature pooling module to obtain visual features of the object candidate region; processing, by an object class embedding vector prediction module, visual features of the object candidate region to obtain the object class embedding vector.

Optionally, the third training unit obtains the visually enhanced textual representation by: detecting a text region containing text from the training image in the third training data set by an OCR module; carrying out character recognition on the detected character area through an OCR module and coding a recognition result to obtain a first semantic feature; extracting, by a TVS module, a second semantic feature directly derived from a visual feature of a detected text region from the text region; combining the first semantic feature and the second semantic feature to obtain the visually enhanced textual representation.

Optionally, the object representation acquiring module further comprises an object detection head module and an object property prediction module, wherein the third training unit acquires the semantically oriented object representation by: performing a convolution operation on the training images from the third training data set by a convolutional neural network to obtain image visual features; obtaining object candidate regions from the image visual features by a region candidate network; processing the object candidate region and the image visual features by a region-of-interest feature pooling module to obtain visual features of the object candidate region; processing the visual features of the object candidate region by an object detection head module to determine a corrected object region of the object candidate region and the visual features and the position features of the object region; processing the visual features of the object candidate region by an object class embedding vector prediction module to obtain an object class embedding vector; combining the visual features of the object region, the positional features of the object region, and the object class embedding vector to obtain the semantically-oriented object representation.

Optionally, the third training unit trains the first multimodal interaction network module and the answer prediction module by: inputting the visually enhanced word representations, the semantically oriented object representations, the problem representations, and a previously decoded vector to a first multimodal interaction network module to obtain word representation enhancement features corresponding to the visually enhanced word representations and a current predictive decoding vector corresponding to the previously decoded vector; inputting the word representation enhanced features and the current predictive decoding vector to an answer prediction module to obtain a first prediction score and a second prediction score corresponding to the question statement; and calculating a fourth loss function according to the first prediction score and the second prediction score and the first real score and the second real score of the standard answer of the question sentence, and training the first multi-modal interaction network module and the answer prediction module by adjusting the model parameters of the first multi-modal interaction network module and the answer prediction module according to the fourth loss function.

Optionally, the text visual question-answering model further includes a context-aware-based answer modification module, wherein the training apparatus further includes: a fourth training unit configured to train the context-aware answer modification module with a fourth training data set, wherein the fourth training data set includes a standard answer for each question sentence provided by a predetermined data set, and recognition results of a plurality of external OCR modules, wherein the predetermined data set includes the question sentence and an input image corresponding to each question sentence, and the recognition results of the plurality of external OCR modules are: and when the output of the answer prediction module aiming at each question sentence points to the image characters in the corresponding input image, the external OCR modules process the image characters to obtain a recognition result.

Optionally, the context-aware-based answer modification module includes a second multimodal interaction network module and a linear mapping layer, wherein the fourth training unit trains the context-aware-based answer modification module in the text visual question-answering model by using a fourth training data set through the following operations: calculating a standard answer of each question sentence and an intersection ratio between recognition results of each of the plurality of external OCR modules recognizing the input image corresponding to each question; inputting a recognition result of each OCR module in the plurality of external OCR modules for recognizing the input image corresponding to each question and context information thereof into a second multi-modal interaction network module for fusion interaction; inputting an output vector of a second multi-modal interaction network module corresponding to each recognition result into the linear mapping layer to obtain a correlation score of each recognition result; and training the second multi-modal interactive network module by utilizing the correlation score of each recognition result and the intersection and comparison of each recognition result to adjust the model parameters of the second multi-modal interactive network module.

The fourth aspect of the present invention provides a text visual question-answering device implemented by a text visual question-answering model, where the text visual question-answering model includes a character representation obtaining module, an object representation obtaining module, a pre-trained language model, a first multi-modal interaction network module, and an answer prediction module, where the text visual question-answering device includes: a word representation obtaining unit configured to obtain visually enhanced word representations from an image by a word representation obtaining module, wherein the word representation obtaining module comprises an Optical Character Recognition (OCR) module and a word-dependent visual semantic mapping network (TVS) module; an object representation acquisition unit configured to obtain semantically oriented object representations from the image by an object representation acquisition module; a problem representation obtaining unit configured to obtain a problem representation from a problem statement corresponding to the image through a pre-trained language model; a first multi-modal interaction network unit configured to obtain, by a first multi-modal interaction network module, from the visually enhanced text representations, the semantically oriented object representations, the problem representations, and a previous decoded vector, text representation enhancement features corresponding to the visually enhanced text representations and a current predictive decoded vector corresponding to the previous decoded vector; an answer prediction unit configured to obtain a predicted answer corresponding to the question sentence from the text characterization enhancement feature and the current predictive decoding vector through an answer prediction module.

Optionally, the text representation obtaining module obtains the visually enhanced text representation from the image input by: detecting a text area containing text from the image through an OCR module; carrying out character recognition on the detected character area through an OCR module and coding a recognition result to obtain a first semantic feature; extracting, by a TVS module, a second semantic feature directly derived from a visual feature of a detected text region from the text region; combining the first semantic feature and the second semantic feature to obtain the visually enhanced textual representation.

Optionally, the TVS module includes: the character image correction module is configured to correct characters of the character area; an encoding module configured to encode the text region after text rectification to obtain the visual feature; an intermediate semantic module configured to predict the second semantic features from the visual features.

Optionally, the object representation acquiring module comprises: a convolutional neural network configured to perform a convolution operation on the image to obtain an image visual feature; a region candidate network configured to obtain object candidate regions from the image visual features; a region-of-interest feature pooling module configured to process the object candidate region and the image visual features to obtain visual features of the object candidate region; an object detection head module configured to process the visual features of the object candidate region to determine a corrected object region of the object candidate region and the visual features and position features of the object region; an object class embedding vector prediction module configured to process the visual features of the object candidate region to obtain the object class embedding vector, wherein the object representation obtaining unit obtains the semantically-oriented object representation by combining the visual features of the object region, the position features of the object region, and the object class embedding vector.

Optionally, the first multimodal interaction network module obtains the text characterization enhancement feature and the current predictive decoding vector from the visually enhanced text characterization, the semantically oriented object characterization, the problem characterization, and a previous decoding vector by: intra-and inter-modal interactions are performed on the visually enhanced text representations, the semantically oriented object representations, the problem representations, and the previously decoded vector to obtain the text representation enhancement features and the current predictive decoding vector.

Optionally, the answer prediction module comprises: a bilinear interactive network configured to process the text-characterizing enhancement features and the current predictive decoding vector to determine a first score when the predictive answer is an image text case in the image; a linear classification layer configured to perform a linear classification process on the current predictive decoding vector to determine a second score when the predicted answer is a dictionary word in a fixed dictionary; a determination module configured to determine one of the image text and the dictionary text corresponding to a maximum value of the first score and the second score as a predicted answer.

Optionally, the bilinear interactive network determines the first score by processing the literal representation enhancement feature and the current predictive decoding vector by: performing linear processing on the character representation enhancement features by utilizing a first linear layer in the double-linear interaction network to obtain a first linear processing result; performing linear processing on the current predictive decoding vector by utilizing a second linear layer in the double-linear interaction network to obtain a second linear processing result; calculating a similarity between the first linear processing result and the second linear processing result as the first score.

Optionally, the text visual question-answering device further comprises: a context-aware-based answer modification unit configured to modify the predicted answer to obtain a modified predicted answer when the image text is determined to be the predicted answer.

Optionally, CRM obtains the revised predicted answer by revising the predicted answer by: processing a plurality of pre-stored external OCR recognition results and the predicted answer to obtain a score value corresponding to each of the plurality of external OCR recognition results; determining an external OCR recognition result having a highest score value among the plurality of external OCR recognition results as the corrected predicted answer.

Optionally, the text visual question-answering model is trained using the training method described above.

A fourth aspect of the present invention is to provide an electronic device comprising a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to execute the instructions to implement the method for training a textual visual question and answer model as described above or the method for textual visual question and answer as described above.

A fifth aspect of the present invention is to provide a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method for training a text visual question-answering model or the method for text visual question-answering as described above.

A sixth aspect of the present invention is to provide a computer program product comprising a computer program/instructions, wherein the computer program/instructions, when executed by a processor, implement the method for training a textual visual question and answer model as described above or the method for textual visual question and answer as described above.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects: by integrating OCR into the forward processing flow of TextVQA, namely by means of multi-modal clues from two stages of character detection and character recognition, reasonable semantic representation of characters can be obtained under the condition that the characters are not accurately recognized, namely more robust feature representation can be obtained, answer errors caused by OCR errors can be effectively corrected, and therefore the influence of OCR precision on multi-modal information reasoning and answer prediction is relieved. In addition, the invention realizes the mapping from vision to semantics through design, thereby obtaining the character representation with enhanced vision and the object representation with semantic guidance, enhancing the robustness of the feature representation and reducing the influence of OCR errors and object recognition errors on reasoning. In addition, the decoded answer is adaptively corrected by using context information rich in TextVQA tasks, so that a more accurate answer is obtained, and the reliability of the answer is enhanced. Furthermore, the visual objects are also used as a special scene character for representation, so that the semantic interval among multiple modes is eliminated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a diagram illustrating an application scenario of a textual visual question answering method and apparatus according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method of training a textual visual question-answering model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a textual visual question-answering model according to an exemplary embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a detailed configuration of a TVS module according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a process of processing a first training data set by a word representation acquisition module to obtain the predicted semantic features and predicted word recognition results according to an exemplary embodiment of the present disclosure;

fig. 6 is a block diagram illustrating a configuration of an encoding module according to an exemplary embodiment of the present disclosure;

FIG. 7 is a block diagram illustrating a detailed configuration of an object representation acquisition module according to an exemplary embodiment of the present disclosure;

FIG. 8 is a flowchart illustrating a process of processing a second training data set by an object representation acquisition module to obtain the predicted object class embedding vector according to an exemplary embodiment of the present disclosure;

FIG. 9 is a flow chart illustrating a process of obtaining a visually enhanced textual representation according to an exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart illustrating a process of obtaining semantically-oriented object representations in accordance with an exemplary embodiment of the present disclosure;

FIG. 11 is a flowchart illustrating a process of training a first multimodal interaction network module and an answer prediction module according to an exemplary embodiment of the present disclosure;

FIG. 12 is a block diagram illustrating components of an answer prediction module according to an exemplary embodiment of the present disclosure;

FIG. 13 is a schematic diagram illustrating a textual visual question-and-answer model according to another exemplary embodiment of the present disclosure;

FIG. 14 is a diagram illustrating a process of training a context-aware based answer modification module according to an exemplary embodiment of the present disclosure;

FIG. 15 is a diagram illustrating training a context-aware based answer modification module according to an exemplary embodiment of the present disclosure;

FIG. 16 is a flowchart illustrating a textual visual question-answering method implemented by a textual visual question-answering model according to an exemplary embodiment of the present disclosure;

FIG. 17 is a block diagram illustrating a training apparatus 1700 of a textual visual question-answering model according to an exemplary embodiment of the present disclosure;

FIG. 18 is a block diagram illustrating a textual visual question-answering device implemented by a textual visual question-answering model, according to an exemplary embodiment of the present disclosure;

FIG. 19 shows experimental validation results of the present invention;

FIG. 20 shows experimental comparison results of the present invention with various prior art baseline methods;

FIG. 21 shows qualitative experimental results of the present invention with various prior art baseline methods;

fig. 22 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Fig. 1 is an application scenario diagram illustrating a text visual question answering method and apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 1, the text visual question answering method and apparatus provided by the present disclosure may be applied to the application environment shown in fig. 1. The system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use

terminal devices

101, 102, 103 to interact with server 105 over network 104 to receive or send messages (e.g., text visual question and answer requests, image upload and download requests, audio-visual data search requests), and so forth. Various communication client applications, such as an audio playing application, a video and audio editing application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording and editing audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may also be equipped with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be equipped with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The server 105 may be a server providing various services, such as a background server providing support for multimedia applications or the like installed on the

terminal devices

101, 102, 103. The background server can analyze and store the received text visual question-answering request and data such as related images and questions.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the text visual question-answering method provided by the embodiment of the present disclosure may be executed by a terminal device, may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the text visual question answering device can be arranged in the terminal equipment, the server or in both the terminal equipment and the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Fig. 2 is a flowchart illustrating a training method of a text visual question-answering model according to an exemplary embodiment of the present disclosure. Fig. 3 is a schematic diagram illustrating a text visual question-and-answer model according to an exemplary embodiment of the present disclosure, wherein the text visual question-and-answer model includes a word representation acquisition module 310, an object representation acquisition module 320, a pre-trained language model 330, a first multimodal interaction network module 340, and an answer prediction module 350. The method for training the text visual question-answering model of fig. 2 is described below with reference to fig. 3.

In step S2010, the first training data set is processed by the text representation obtaining module 310 to obtain predicted semantic features corresponding to text regions in the first training data set and a predicted text recognition result. The first training data set is a data set comprising a plurality of training images, and may be, for example, a SynthText data set and a Synth90K data set, but the application is not limited thereto, and the first training data set may be any other suitable data set. As shown in FIG. 3, the text characterization acquisition module 310 includes an OCR module 311 and a text-dependent visual semantic mapping network (TVS) module 312, wherein the OCR module 311 may be implemented using any existing OCR module, and herein, the OCR module 311 is already a pre-trained OCR module and, therefore, will not be described in detail herein. However, the application is obviously different from the existing text visual question answering (TextVQA) in the role of an OCR module, the application considers that the OCR module should not be just a preprocessing of the TextVQA, but the TextVQA is different from a key module of the traditional VQA, the application can obtain more robust feature representation by fusing the OCR module 311 into a forward processing flow of the TextVQA, and effectively correct answer errors caused by the OCR errors, so that the influence of OCR precision on multi-modal information reasoning and answer prediction is relieved, and experiments show that the invention is effective on a plurality of data sets, has obvious advantages in robustness and accuracy compared with the existing methods, and can play huge potential in a real application scene. Step S2010 is described in detail below with reference to fig. 4 and 5.

Fig. 4 is a block diagram illustrating a detailed configuration of the TVS module 312 according to an exemplary embodiment of the present disclosure. FIG. 5 is a flowchart illustrating a process of processing a first training data set by the word representation acquisition module 310 to obtain the predicted semantic features and predicted word recognition results according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the TVS module 312 includes a rectification module 3121, an encoding module 3122, an intermediate semantic module 3123, and a decoding module 3124.

In step S2011, the text region detected from the first training data set is corrected by the correction module 3121. Specifically, the OCR module 311 performs text detection on each training image in the first training data set to determine an image text region containing text, and then the rectification module 3121 performs image text rectification on the image text region, which may be performed by any existing image text rectification method, which is not specifically limited herein.

In step S2012, the encoding module 3122 encodes the text region of the image after text rectification to obtain a visual feature. Specifically, the encoding module 3122 may be implemented using a well-known encoding module, for example, as shown in fig. 6, the encoding module 3122 may be composed of a Convolutional Neural Network (CNN) and a long and short term memory network (LSTM).

In step S2013, a predicted semantic feature is derived from the visual feature by the first linear layer in the intermediate semantic module 3123. Specifically, as shown in fig. 4, the intermediate semantic module 3123 obtains visual features from the encoding module 3122, and then performs a dimension transform on the video features by a first linear layer in the intermediate semantic module 3123 to predict semantic features, thereby obtaining predicted semantic features directly derived from the visual features.

In step S2014, semantic information corresponding to the predicted semantic features is derived from the predicted semantic features by the second linear layer in the intermediate semantic module 3123. Specifically, the second linear layer in the intermediate semantic module 3123 obtains the predicted semantic features from the first linear layer, and then performs the dimension transformation on the predicted semantic features again, thereby obtaining semantic information corresponding to the predicted semantic features.

In step S2015, the predicted character recognition result is obtained by the decoding module 3124 according to the semantic information and the visual features. In particular, the decoding module 3124 may be composed of a recurrent neural network based on an attention mechanism, such as a Gated Recurrent Unit (GRU) based on an attention mechanism and a long-term memory network (LSTM), but the present invention is not limited thereto. The decoding module 3124 obtains the visual features from the encoding module 3122 and obtains semantic information from the intermediate semantic module 3123 obtained via the second linear layer transform, and then performs a decoding operation on the visual features and the semantic information to obtain a predicted word recognition result.

Referring back to fig. 2, in step S2020, a first loss function, i.e. a semantic loss, is calculated based on the predicted semantic features and the real semantic features corresponding to the image text regions. Specifically, the first loss function Lsem may be calculated according to equation (1) below.

L_sem＝1-cos(e_pred，e_gt) (1)

e_predRepresenting predicted semantic features (i.e. predicted semantic feature vectors), e_gtRepresenting the real semantic features (i.e., real semantic feature vectors) obtained by encoding the real recognition results with the FastText model.

In step S2030, a second loss function, i.e., a recognition loss, is calculated based on the predicted character recognition result and the actual character recognition result corresponding to the image character region. Specifically, the second loss function Lrec may be calculated according to equation (2) below, i.e., the cross entropy of the predicted and actual word recognition results is calculated.

Wherein, y₁，…，y_t，…，y_NRepresenting the real character recognition result (i.e. the real character sequence, where N represents the total number of real characters) and I represents the input training image.

In step S2040, the first loss function and the second loss function are used to adjust the model parameters of the text representation obtaining module 310 to train the text representation obtaining module 310. Specifically, since the OCR module 311 is selected from existing pre-trained OCR modules, no further training is required, and therefore, here, it is only necessary to adjust the model parameters of the TVS module 312 according to the first loss function and the second loss function calculated in steps S2020 and S2030 above, so that the final values of the first loss function and the second loss function do not drop any more within a predetermined length of training time.

In step S2050, the second training data set is processed by the object representation acquisition module 320 to obtain predicted object class embedding vectors for objects in the second training data set. Step S2050 is described in detail below with reference to fig. 7 and 8.

Fig. 7 is a block diagram showing a detailed configuration of the object representation acquisition module 320 according to an exemplary embodiment of the present disclosure. Fig. 8 is a flowchart illustrating a process of processing a second training data set by the object representation acquisition module 320 to obtain the predicted object class embedding vector according to an exemplary embodiment of the present disclosure. Specifically, the object representation obtaining module 320 extends an object class embedding vector prediction module 326 on the basis of the original object recognition model fast R-CNN, as shown in fig. 7, the object representation obtaining module 320 includes a convolutional neural network 321, an area candidate network 323, an interest area feature pooling module 322, an object detection head module 324, an object property prediction module 325, and an object class embedding vector prediction module 326.

In step S2051, the convolution operation is performed on the second training data set by the convolutional neural network 321 to obtain an image visual feature. In particular, the second training data set employed in training object representation acquisition module 320 is a data set that includes a plurality of training images, which may be, for example, a Visual Genome data set, although the invention is not limited in this respect and any suitable data set may be employed to train object representation acquisition module 320.

In step S2052, object candidate regions are obtained from the image visual characteristics by the region candidate network 323.

In step S2053, the object candidate region and the image visual feature are processed by the region of interest feature pooling module 322 to obtain the visual feature of the object candidate region.

In step S2054, the object class embedding vector prediction module 326 processes the visual features of the object candidate region to obtain the object class embedding vector, which may also be referred to as semantic features of the object.

Referring back to fig. 2, in step S2060, a third loss function is calculated according to the predicted object class embedding vector and the real object class vector of the object, and the model parameters of the object representation obtaining module 320 are adjusted according to the third loss function to train the object representation module.

Specifically, the third loss function L may be calculated according to equation (3) below_{obj_label_emb}I.e. the cosine distance between the object class embedding vector and the real object class vector of the object (i.e. the FastText semantic features of the object class name) is calculated as the training loss:

L_{obj_label_emb}＝1-cos(e_pred，e_gt) (3)

e_predrepresenting said object class embedding vector, e_gtA real object class vector representing the object.

In addition, three other loss functions are used in training the object representation acquisition module 320, such as the loss function relating to the object candidate area prediction required in the area candidate network 323, the loss function relating to the object area prediction after the modification required in the object detection head module 324, and the loss function relating to the object attribute prediction required in the object attribute prediction module 325, therefore, in training the object representation acquiring module 320, the object detecting head module 324 processes the visual features of the object candidate region to determine a modified object region of the object candidate region and the visual features and the position features of the object region, wherein the modified object region is determined by the object detection head module 324 according to a confidence score obtained when processing the visual features of the object candidate region. In training the object representation obtaining module 320, the object property prediction module 325 processes the visual features of the object candidate region to obtain an object property prediction score.

Specifically, the loss function L related to the object candidate region prediction, which is required by the region candidate network 323 when the object representation obtaining module 320 is trained_rpnThe following equation (4) may be employed:

wherein p is_iIs the probability that the anchor frame is predicted to be an object;

is the true classification label of the anchor frame; l is_clsIs a classification penalty, which can be taken as a cross-entropy penalty; n is a radical of_clsIs the batch size; t is t_iIs the coordinates of the predicted object frame;

is the coordinates of the real object frame; l is_regIs the bounding box regression loss, a smooth L1 norm loss can be used; n is a radical of_regIs the number of anchor frame positions; lambda [ alpha ]₁Is a weight parameter.

In addition, the loss function L related to the corrected object region prediction, which is needed by the object detection head module 324 when the object representation acquiring module 320 is trained_detThe following equation (5) may be employed:

L_det(p，u，t^u，v)＝L_cls(p，u)+λ₂[u≥1]L_loc(t^u，v) (5)

wherein p is the classification probability of the candidate frame on all object classes; u is the true classification label of the candidate box; l is_clsIs a classification loss, cross entropy loss can be employed; t is t^uIs the coordinates of the predicted object frame; v is the coordinates of the real object frame; l is_locIs the bounding box regression loss, which can be taken as the smooth L1 norm loss;λ₂is a weight parameter.

In addition, the loss function L related to the object property prediction, which is needed by the object property prediction module 325 when the object representation obtaining module 320 is trained, is used_attrThe following equation (6) may be employed:

L_attr＝-∑_ky_klog(σ(l_k))+(1-y_k)log(1-σ(l_k)) (6)

where k is an index of the object property categories; y is_kIs a true attribute category label; l_kIs the attribute classification probability; σ is the sigmoid activation function.

In step S2070, visually enhanced word tokens are obtained by the trained word token obtaining module 310 for the third training data set, semantically oriented object tokens are obtained by the trained object token obtaining module 320 for the third training data set, and problem tokens are obtained by the pre-trained language model 330 for problem sentences in the third training data set.

Specifically, when the first multi-modal interaction network module 340 and the answer prediction module 350 are trained, the trained text representation obtaining module 310, the object representation obtaining module 320 and the pre-trained language model 330 are required to obtain a data set for training the first multi-modal interaction network module 340 and the answer prediction module 350. Wherein the third training data set may be a training data set including a plurality of training images and question sentences corresponding to each of the training images. The operation of obtaining visually enhanced textual and semantically oriented object representations is described in detail below with reference to fig. 9 and 10.

FIG. 9 is a flow chart illustrating a process of obtaining a visually enhanced textual representation according to an exemplary embodiment of the present disclosure. FIG. 10 is a flowchart illustrating a process of obtaining semantically-oriented object representations according to an exemplary embodiment of the present disclosure.

As shown in fig. 9, in step S910, a text region containing text is detected from the training image in the third training data set by the OCR module 311.

In step S920, a first semantic feature is obtained by the OCR module 311 performing character recognition on the detected character region and encoding the recognition result. Since the operations employed in step S910 and step S920 may be implemented according to an existing OCR module, they will not be described in detail herein.

In step S930, a second semantic feature directly derived from the visual feature of the text region is extracted from the detected text region by the TVS module 312. Specifically, the text region is first corrected by the text image correction module 3121, then the corrected text region is encoded by the encoding module 3122 to obtain the visual feature, and finally the intermediate semantic module 3123 predicts the second semantic feature from the visual feature. In other words, step S930 is similar to steps S2011, S2012 and S2013 described above with reference to fig. 5, that is, the process described above with reference to fig. 5 is a process of training the TVS module 312, and step S930 is a process of actually using the trained TVS to obtain semantic features, and in step S930, only the first linear layer of the rectification module 3121, the coding module 3122 and the intermediate semantic module 3123 in fig. 4 is actually used, in other words, the second semantic feature is the output of the first linear layer of the intermediate semantic module 3123.

In step S940, the first semantic feature and the second semantic feature are combined to obtain the visually enhanced word representation. The process of obtaining semantically-oriented object representations for the third training data set by trained object representation acquisition module 320 is described below with reference to FIG. 10.

As shown in fig. 10, in step S1010, a convolution operation is performed on the training image from the third training data set by the convolutional neural network 321 to obtain an image visual feature.

In step S1020, object candidate regions are obtained from the image visual characteristics by the region candidate network 323.

In step S1030, the object candidate region and the image visual feature are processed by the region of interest feature pooling module 322 to obtain the visual feature of the object candidate region.

In step S1040, the object detection head module 324 processes the visual features of the object candidate region to determine a modified object region of the object candidate region and the visual features and the position features of the object region.

In step S1050, the visual features of the object candidate region are processed by the object class embedding vector prediction module 326 to obtain an object class embedding vector.

In step S1060, the visual features of the object region, the position features of the object region, and the object class embedding vector are combined to obtain the semantically-oriented object representation. In this way, the visual object is treated as a special character, and the visual object with semantic relation has higher feature similarity, so that the model can reasonably understand the semantic information of the object and can further assist in analyzing the character information in the image.

Returning to fig. 2, in step S2080, the first multimodal interaction network module 340 and the answer prediction module 350 are trained according to the visually enhanced character representations, the semantically oriented object representations, the question representations, and the previous decoding vectors. Step S2080 is described in detail below with reference to fig. 11.

Fig. 11 is a flowchart illustrating a process of training the first multimodal interaction network module 340 and the answer prediction module 350 according to an exemplary embodiment of the present disclosure.

In step S1110, the visually enhanced word representation, the semantically oriented object representation, the problem representation, and the previous decoding vector are input to the first multimodal interaction network module 340 to obtain a word representation enhancement feature corresponding to the visually enhanced word representation and a current predictive decoding vector corresponding to the previous decoding vector. The first multimodal interaction network module 340 may be implemented by using a Transformer network model, since the internal structure of the Transformer network model is not modified in the present application, the internal training process is not described here, and the Transformer network model is a model with the same number of input quantities as that of output quantities, so when the visually enhanced character representation, the semantically oriented object representation, the problem representation, and the previously decoded vector are input into the Transformer network model, the outputs of the Transformer network model are respectively the character representation enhancement feature corresponding to the visually enhanced character representation, the object representation enhancement feature corresponding to the semantically oriented object representation, the problem representation enhancement feature corresponding to the problem representation, and the currently predicted decoded vector corresponding to the previously decoded vector, but only the character representation enhancement feature and the previously decoded vector are used when the answer prediction module 350 is subsequently trained The current predictive decodes the vector without using both.

In step S1120, the word characterization enhancement features and the current predictive decoding vector are input to the answer prediction module 350 to obtain a first prediction score and a second prediction score corresponding to the question sentence. As shown in fig. 12, the answer prediction module 350 includes a bilinear interaction network 351, a linear classification layer 352, and a determination module 353.

Specifically, first, the word characterization enhancement features and the current predictive decoding vector are processed using a bi-linear interaction network 351 to determine a first predictive score when the predictive answer is an image word case in the image. Specifically, the word characterization enhancement feature is linearly processed by using a first linear layer in the bilinear interactive network 351, the current predictive decoding vector is linearly processed by using a second linear layer in the bilinear interactive network 351, and then the similarity, i.e., the first prediction score, is calculated for the results of the two linear processes.

Meanwhile, the current predictive decoding vector is subjected to a linear classification process using the linear classification layer 352 to determine a second predictive score when the predicted answer is a dictionary word in a fixed dictionary.

In step S1130, a fourth loss function is calculated according to the first and second prediction scores and the first and second real scores of the standard answer of the question sentence, and model parameters of the first multi-modal interaction network module 340 and the answer prediction module 350 are adjusted according to the fourth loss function to train the first multi-modal interaction network module 340 and the answer prediction module 350.

Specifically, according to the standard answer of the question sentence, a first real score when the standard answer is an image text in the image may be calculated by the determination module 353, and a second real score when the standard answer is a dictionary text in the fixed dictionary may be calculated, and then a classification loss of the first real score and the first prediction score and a classification loss of the second real score and the second prediction score may be calculated, and then model parameters of the first multimodal interaction network module 340 and the answer prediction module 350 may be adjusted according to the two classification losses to train the first multimodal interaction network module 340 and the answer prediction module 350.

Through the training method described above with reference to fig. 2, the trained text visual question-answering model can be obtained, and then text visual question-answering can be performed.

In addition, in order to make the predicted result of the text visual question-answering model more accurate, as shown in fig. 13, the text visual question-answering model may further include another module, i.e., a context-aware-based answer modification module 360, in addition to the text representation obtaining module 310, the object representation obtaining module 320, the pre-trained language model 330, the first multi-modal interaction network module 340, and the answer prediction module 350 in fig. 3, where the context-aware-based answer modification module 360 may adaptively modify the predicted answer by using rich context information of the predicted answer, and therefore, the training method described in fig. 2 may further include: the context-aware answer modification module 360 is trained with a fourth training data set, wherein the fourth training data set includes a standard answer for each question sentence provided by a predetermined data set including the question sentence and an input image corresponding to each question sentence, and recognition results of a plurality of external OCR modules, for example, the predetermined data set may be a TextVQA data set, wherein the recognition results of the plurality of external OCR modules are: when the output of the answer prediction module 350 for each question sentence points to the image text in the corresponding input image, the external OCR modules process the image text to obtain the recognition result. This is described below with reference to fig. 14 and 15.

Fig. 14 is a diagram illustrating a process of training the context-aware based answer modification module 360, according to an exemplary embodiment of the present disclosure. FIG. 15 is a diagram illustrating training of a context-aware answer modification module 360, according to an example embodiment of the present disclosure.

As shown in fig. 14, in step S1410, an intersection ratio IOU between the standard answer of each question sentence and the recognition result of each of the plurality of external OCR modules recognizing the input image corresponding to each question is calculated.

In step S1420, the recognition result (i.e., "candidate answer" in fig. 15) of each of the external OCR modules recognizing the input image corresponding to each question and its context information (i.e., "question" information, "text" information, and "object" information in fig. 15) are input to the second multimodal interaction network module 361 for the fusion interaction. The "question" information included in the context information represents a question representation obtained by processing a question sentence by using the language model 330 in fig. 3, the "text" information represents the visually enhanced text representation corresponding to the other text remaining after the text in the image is recognized and the "candidate answer" is removed by using the OCR module 311 in fig. 3, and the "object" information represents the semantically oriented object representation corresponding to an object whose intersection with the position of the "candidate answer" is greater than the IOU by a predetermined threshold. The second multimodal interaction network module 361 is a Transformer network model, and since the internal structure of the Transformer network model is not modified in the present application, it is not described in detail herein.

In step S1430, the output vector of the second multi-module interaction network module 361 corresponding to each recognition result is input to the linear mapping layer 362 to obtain a correlation score for each recognition result, wherein the correlation score is a value belonging to the interval [0,1 ].

In step S1440, the second multimodal interaction network module 361 is trained by adjusting the model parameters of the second multimodal interaction network module 361 according to the correlation score of each recognition result and the cross-correlation of each recognition result.

The training of the answer modification module 360 based on context awareness can be completed through the processes described above with reference to fig. 14 and 15, so that the text visual question-answering model can correct unreasonable OCR recognition results, and the reliability of the answer is enhanced. A text visual question-answering method for performing text visual question-answering by applying the trained text visual question-answering model will be described with reference to fig. 16.

Fig. 16 is a flowchart illustrating a text visual question-answering method implemented by a text visual question-answering model according to an exemplary embodiment of the present disclosure, which includes a word representation acquisition module 310, an object representation acquisition module 320, a pre-trained language model 330, a first multimodal interaction network module 340, and an answer prediction module 350, as described above with reference to fig. 3.

As shown in fig. 16, in step S1610, visually enhanced word representations are obtained from an image by the word representation obtaining module 310, wherein the word representation obtaining module 310 includes an OCR module 311 and a TVS module 312, wherein the text visual question and answer model has two inputs, namely, the image and a question and sentence corresponding to the image, wherein the image is respectively input to the word representation obtaining module 310 and the object representation obtaining module 320 in the text visual question and answer model, and the question and sentence is input to the language model 330. Step S1610 is described in detail below.

Firstly, a character region containing characters is detected from the image through the OCR module 311, the first semantic feature is obtained by performing character recognition on the detected character region through the OCR module 311 and encoding the recognition result, the second semantic feature directly derived from the visual feature of the character region is extracted from the detected character region through the TVS module 312, and then the visually enhanced character representation is obtained by combining the first semantic feature and the second semantic feature. Here, when the character representation acquiring module 310 is used after being trained, only the character image rectification module 3121, the encoding module 3122, and the intermediate semantic module 3123 included in the TVS module 312 are used.

Specifically, the step of extracting, by the TVS module 312, from the detected text region, a second semantic feature directly derived from the visual feature of the text region includes: the character image correction module 3121 corrects the characters of the character area; coding the corrected text region by a coding module 3122 to obtain the visual features; a second semantic feature is predicted from the visual feature by the intermediate semantic module 3123. Since the process is the same as the process described above with reference to fig. 9, a repeated description thereof will not be provided here.

In step S1620, semantically oriented object representations are obtained from the image by the object representation obtaining module 320, as shown in fig. 7, the object representation obtaining module 320 includes a convolutional neural network 321, a region candidate network 323, a region-of-interest feature pooling module 322, an object detection head module 324, and an object class embedding vector prediction module 326.

The step of obtaining the semantically oriented object representation from the image by the object representation acquisition module 320 comprises: performing a convolution operation on the image by a convolution neural network 321 to obtain an image visual feature; obtaining object candidate regions from the image visual features by a region candidate network 323; processing the object candidate region and the image visual features by a region of interest feature pooling module 322 to obtain visual features of the object candidate region; processing the visual features of the object candidate region by the object detection head module 324 to determine a modified object region of the object candidate region and the visual and positional features of the object region; processing the visual features of the object candidate region by an object class embedding vector prediction module 326 to obtain an object class embedding vector; the visual features of the object region, the positional features of the object region, and the object class embedding vector are then combined to obtain the semantically-oriented object representation. Since the process is the same as the process described above with reference to fig. 10, a repeated description thereof will not be provided here. The semantic-oriented object representation is obtained from the image by the object representation obtaining module 320, which can alleviate the problem of large semantic separation between the object modality represented by the visual feature and the text and problem modalities.

In step S1630, a question representation is obtained from the question statement corresponding to the image by the pre-trained language model 330. The language model 330 employed herein may be any language model available in the art and, therefore, is not defined and described in detail herein.

In step S1640, a word representation enhancement feature corresponding to the visually enhanced word representation and a current predictive decoding vector corresponding to the previous decoding vector are obtained from the visually enhanced word representation, the semantically oriented object representation, the problem representation and the previous decoding vector by the first multimodal interaction network module 340, i.e. the word representation enhancement feature and the current predictive decoding vector are obtained by intra-and inter-modal interaction of the visually enhanced word representation, the semantically oriented object representation, the problem representation and the previous decoding vector by the first multimodal interaction network module 340. Since the process is similar to the process described above with reference to fig. 11, it will not be described in detail here.

At step S1650, obtaining a predicted answer corresponding to the question sentence from the text characterization enhancement feature and the current predictive decoding vector through an answer prediction module 350, as shown in fig. 12, where the answer prediction module 350 includes a bilinear interactive network 351, a linear classification layer 352 and a determination module 353, and the step of obtaining the predicted answer from the text characterization enhancement feature and the current predictive decoding vector through the answer prediction module 350 includes: processing the text characterization enhancement features and the current predictive decoding vector with bilinear interactive network 351 to determine a first score when the predictive answer is an image text condition in the image; performing a linear classification process on the current predictive decoding vector using a linear classification layer 352 to determine a second score when the predicted answer is a dictionary word in a fixed dictionary; the determination module 353 determines one of the image text and the dictionary text corresponding to the maximum value of the first score and the second score as the predicted answer. Furthermore, the step of processing the text characterization enhancement features and the current predictive decoding vector using bilinear interactive network 351 to determine a first score comprises: performing linear processing on the character representation enhancement features by using a first linear layer in the bilinear interactive network 351 to obtain a first linear processing result; performing linear processing on the current predictive decoding vector by using a second linear layer in the bilinear interactive network 351 to obtain a second linear processing result; calculating a similarity between the first linear processing result and the second linear processing result as the first score.

Further, the illustrated text visual question answering method may include, in addition to the above steps S1610 to S1640: when the image text is determined to be the predicted answer, the CRM 360 is utilized to modify the predicted answer to obtain the modified predicted answer. In other words, when the predicted answer (i.e., the candidate answer in fig. 15) is from the image text, the CRM 360 can also be used to correct the predicted answer to obtain a more accurate predicted answer.

Specifically, the step of modifying the predicted answer using the CRM 360 to obtain the modified predicted answer includes: processing a plurality of pre-stored external OCR recognition results and the predicted answer to obtain a score value corresponding to each of the plurality of external OCR recognition results; determining an external OCR recognition result having a highest score value among the plurality of external OCR recognition results as the corrected predicted answer. The above process is similar to the contents of steps S1410 to S1430 described with reference to fig. 14, and therefore, a description thereof will not be repeated here.

Fig. 17 is a block diagram illustrating a training apparatus 1700 of a text visual question-and-answer model according to an exemplary embodiment of the present disclosure, as described above with reference to fig. 3, the text visual question-and-answer model includes a pre-trained language model 330, a word representation acquisition module 310, an object representation acquisition module 320, a first multimodal interaction network module 340, and an answer prediction module 350, or, as shown in fig. 13, the text visual question-and-answer model may include a context-aware-based answer modification module (CRM)360 in addition to the pre-trained language model 330, the word representation acquisition module 310, the object representation acquisition module 320, the first multimodal interaction network module 340, and the answer prediction module 350.

As shown in fig. 17, training apparatus 1700 includes a first training unit 1710, a second training unit 1720, and a third training unit 1730.

The first training unit 1710 may be configured to process the first training data set by the word representation acquisition module 310 to obtain a predicted semantic feature and a predicted word recognition result corresponding to an image word region in the first training data set, calculate a first loss function according to the predicted semantic feature and a real semantic feature corresponding to the image word region, calculate a second loss function according to the predicted word recognition result and a real word recognition result corresponding to the image word region, and train the word representation acquisition module 310 by adjusting model parameters of the word representation acquisition module 310 using the first loss function and the second loss function.

The second training unit 1720 may be configured to train the object representation acquisition module 320 by processing the second training data set by the object representation acquisition module 320 to obtain a predicted object class embedding vector for an object in the second training data set, calculating a third loss function from the predicted object class embedding vector and a true object class vector of the object, and adjusting model parameters of the object representation acquisition module 320 according to the third loss function.

The third training unit 1730 may be configured to obtain visually enhanced word tokens for the third training data set by the trained word token obtaining module 310, obtain semantically oriented object tokens for the third training data set by the trained object token obtaining module 320, obtain question tokens for question sentences by the pre-trained language model 330, and train the first multimodal interaction network module 340 and the answer prediction module 350 according to the visually enhanced word tokens, the semantically oriented object tokens, the question tokens, and the previous decoded vectors.

Furthermore, the training apparatus 1700 may further comprise a fourth training unit (not shown) which may be configured to train the context-aware based answer modification module (CRM)360 with a fourth training data set, wherein the fourth training data set comprises a standard answer for each question sentence provided by a predetermined data set and recognition results of a plurality of external OCR modules, wherein the predetermined data set comprises the question sentence and an input image corresponding to each question sentence, wherein the recognition results of the plurality of external OCR modules are: when the output of the answer prediction module 350 for each question sentence points to the image text in the corresponding input image, the external OCR modules process the image text to obtain the recognition result.

Since the training method shown in fig. 2 may be performed by the training apparatus 1700 shown in fig. 17, and the first training unit 1710 may perform operations corresponding to steps S2010 to S2040 of fig. 2, the second training unit 1720 may perform operations corresponding to steps S2050 and S2060 of fig. 2, and the third training unit 1730 may perform operations corresponding to steps S2070 and S2080 of fig. 2, and further, an operation of training the CRM 360 with the fourth training data set (not shown in fig. 2) included in the training method in fig. 2 may be performed by the fourth training unit in the training apparatus 1700, any relevant details related to the operations performed by the respective units or modules of the training apparatus 1700 may be referred to the corresponding description about fig. 2, and are not repeated here.

Fig. 18 is a block diagram illustrating a text visual question-answering apparatus 1800 implemented by a text visual question-answering model according to an exemplary embodiment of the present disclosure, wherein the text visual question-answering model includes a pre-trained language model 330, a word representation acquisition module 310, an object representation acquisition module 320, a first multimodal interaction network module 340, and an answer prediction module 350, as described above with reference to fig. 3, or, as shown in fig. 13, may include a context-aware-based answer modification module (CRM)360 in addition to the pre-trained language model 330, the word representation acquisition module 310, the object representation acquisition module 320, the first multimodal interaction network module 340, and the answer prediction module 350.

As shown in fig. 18, the text visual question answering apparatus 1800 may include a word representation acquiring unit 1810, an object representation acquiring unit 1820, a question representation acquiring unit 1830, a first multimodal interaction network unit 1840, and an answer predicting unit 1850.

The text representation acquisition unit 1810 may be configured to obtain visually enhanced text representations from the image by the text representation acquisition module 310. Object representation acquisition unit 1820 may be configured to obtain semantically oriented object representations from the image by object representation acquisition module 320. The problem representation obtaining unit 1830 may be configured to obtain a problem representation from a problem statement corresponding to the image through the pre-trained language model 330. The first multi-modal interaction network unit 1840 may be configured to obtain, through the first multi-modal interaction network module 340, a text characterization enhancement feature corresponding to the visually enhanced text characterization and a current predictive decoding vector corresponding to the previous decoding vector from the visually enhanced text characterization, the semantically oriented object characterization, the problem characterization, and the previous decoding vector. The answer prediction unit 1820 is configured to obtain a predicted answer corresponding to the question statement from the text characterization enhancement feature and the current predictive decoding vector through the answer prediction module 350.

In addition, the text-visual question answering apparatus 1800 may further include a context-aware-based answer modification unit (not shown), and the context-aware-based answer modification unit may be configured to modify the predicted answer to obtain a modified predicted answer when the image text is determined to be the predicted answer.

Since the text visual question answering method of fig. 16 can be performed by the text visual question answering apparatus 1800 shown in fig. 18, and the word representation acquiring unit 1810, the object representation acquiring unit 1820, the question representation acquiring unit 1830, the first multimodal interaction network unit 1840, and the answer predicting unit 1850 can perform operations corresponding to steps S1610, S1620, S1630, S1640, and S1650 in fig. 16, respectively. In addition, the operation of modifying the predicted answer with the CRM 360 to obtain the modified predicted answer when the image text is determined as the predicted answer (not shown in fig. 16) included in fig. 16 may be performed by a context-aware answer modification unit in the text-vision question-answering device 1800, so any relevant details related to the operation performed by each unit or module in the text-vision question-answering device 1800 may be referred to in the corresponding description of fig. 16, and are not repeated herein.

Further, it should be noted that although the training apparatus 1700 and the text visual question-answering apparatus 1800 are described above as being divided into units for performing the respective processes, it is clear to those skilled in the art that the processes performed by the units may be performed without any specific division of the units by the training apparatus 1700 and the text visual question-answering apparatus 1800 or without explicit demarcation between the units. In addition, the training apparatus 1700 and the text visual question-answering apparatus 1800 may further include a communication unit (not shown), an audio playing unit (not shown), a processing unit (not shown), and a storage unit (not shown), among others.

Fig. 19 shows the experimental verification results of the present invention.

In the experiment for the text visual question-answering method and the text visual question-answering device proposed by the present invention, a TextVQA dataset may be used, wherein the TextVQA dataset includes 28408 images and 45336 questions. The validation set and test set employed in the experiment were each a partial data set randomly selected from the TextVQA data set. In the experiment, 4 result models were compared, the first reference model (i.e., a model without any technique proposed by the present invention, such as the M4C model), the second model was a model obtained by fusing the technique for obtaining visually enhanced text representations (TVS) of the present invention on the basis of M4C (i.e., M4C + TVS), the third model was a model obtained by fusing the technique for obtaining semantically oriented object representations (SEO) of the present invention on the basis of M4C (i.e., M4C + SEO), and the fourth model was a model obtained by fusing the TVS, SEO of the present invention on the basis of M4C and the answer correction module (CRM) based on context perception (i.e., M4C + TVS + SEO + CRM).

As shown in fig. 19, table 1 shows the ablation experiment of the present invention on the TextVQA data set, and the results show that each part of TVS, SEO and CRM in the text visual question-answering method and text visual question-answering device proposed by the present invention can bring significant improvement.

Figure 20 shows experimental comparison results of the present invention with a number of prior art baseline methods.

Table 2 in FIG. 20 shows the comparison of the effect of the present invention and various prior methods on two data sets TextVQA and ST-VQA, where the ST-VQA data set includes 23038 images and 31791 questions. In addition, the TextVQA validation set and the TextVQA test set are two partial data sets randomly selected from the TextVQA data set, respectively, and the ST-VQA validation set is a partial data set randomly selected from the ST-VQA data set. The invention achieves very good performance on a plurality of data sets, and proves the superiority of the invention.

Fig. 21 shows qualitative experimental results of the present invention with various existing reference methods, from which it can be found that the present invention is more robust to text recognition results and the answer prediction is more accurate and complete.

Fig. 22 is a block diagram illustrating an electronic device 2200 in accordance with an embodiment of the present disclosure. Referring to fig. 22, the electronic device 2100 may include at least one memory 2210 and at least one processor 2220, the at least one memory 2210 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 2220, perform a training method or a text visual question-and-answer method according to embodiments of the present disclosure.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in the memory, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory may be integral to the processor, e.g., RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a training method or a text visual question-answering method according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product including computer instructions which, when executed by a processor, implement a training method or a text visual question-answering method according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a text visual question-answer model is characterized in that the text visual question-answer model comprises a pre-trained language model, a character representation acquisition module, an object representation acquisition module, a first multi-mode interaction network module and an answer prediction module, wherein the training method comprises the following steps:

processing the first training data set by a character representation acquisition module to obtain a predicted semantic feature corresponding to an image character region in the first training data set and a predicted character recognition result, calculating a first loss function according to the predicted semantic feature and a real semantic feature corresponding to the image character region, calculating a second loss function according to the predicted character recognition result and a real character recognition result corresponding to the image character region, and adjusting a model parameter of the character representation acquisition module by using the first loss function and the second loss function to train the character representation acquisition module;

processing the second training data set by an object representation acquisition module to obtain a predicted object class embedding vector of the object in the second training data set, calculating a third loss function according to the predicted object class embedding vector and a real object class vector of the object, and adjusting a model parameter of the object representation acquisition module according to the third loss function to train the object representation acquisition module;

the method comprises the steps that a trained character representation acquisition module acquires visually enhanced character representations aiming at a third training data set, a trained object representation acquisition module acquires semantically oriented object representations aiming at the third training data set, a pre-trained language model acquires problem representations aiming at problem sentences, and a first multi-mode interactive network module and an answer prediction module are trained according to the visually enhanced character representations, the semantically oriented object representations, the problem representations and a previous decoding vector.

2. A text visual question-answering method realized by a text visual question-answering model is characterized in that the text visual question-answering model comprises a character representation acquisition module, an object representation acquisition module, a pre-trained language model, a first multi-mode interaction network module and an answer prediction module, wherein the text visual question-answering method comprises the following steps:

obtaining visually enhanced text representations from an image by a text representation acquisition module, wherein the text representation acquisition module comprises an Optical Character Recognition (OCR) module and a text-dependent visual semantic mapping network (TVS) module;

obtaining semantically oriented object representations from the image by an object representation obtaining module;

obtaining a problem representation from a problem statement corresponding to the image through a pre-trained language model;

obtaining, by a first multimodal interaction network module, from the visually enhanced text representations, the semantically oriented object representations, the problem representations, and a previous decoded vector, text representation enhancement features corresponding to the visually enhanced text representations and a current predictive decoded vector corresponding to the previous decoded vector;

and obtaining a predicted answer corresponding to the question sentence from the text characterization enhanced features and the current predictive decoding vector through an answer prediction module.

3. The text visual question answering method according to claim 2, characterized in that an answer prediction module comprises a bilinear interactive network, a linear classification layer and a determination module, wherein the step of obtaining the predicted answer from the word characterization enhancement feature and the current predictive decoding vector through the answer prediction module comprises:

processing the text characterization enhancement features and the current predictive decoding vector by utilizing a bilinear interactive network to determine a first score when the predictive answer is an image text condition in the image;

performing a linear classification process on the current predictive decoding vector using a linear classification layer to determine a second score when the predicted answer is a dictionary word in a fixed dictionary;

determining, with a determination module, one of the image text and the dictionary text corresponding to a maximum value of the first score and the second score as the predicted answer.

4. The textual visual question-answering method according to claim 3, characterized in that the textual visual question-answering model further comprises a context-aware based answer modification module (CRM), wherein the textual visual question-answering method further comprises:

when the image text is determined to be the predicted answer, correcting the predicted answer by using CRM to obtain the corrected predicted answer.

5. The textual visual question-answering method according to claim 4, wherein the step of modifying the predicted answer with CRM to obtain the modified predicted answer comprises:

processing a plurality of pre-stored external OCR recognition results and the predicted answer to obtain a score value corresponding to each of the plurality of external OCR recognition results;

determining an external OCR recognition result having a highest score value among the plurality of external OCR recognition results as the corrected predicted answer.

6. A training device of a text visual question-answer model is characterized in that the text visual question-answer model comprises a pre-trained language model, a character representation acquisition module, an object representation acquisition module, a first multi-mode interaction network module and an answer prediction module, and the training device comprises:

a first training unit configured to process a first training data set through a character representation acquisition module to obtain a predicted semantic feature corresponding to an image character region in the first training data set and a predicted character recognition result, calculate a first loss function according to the predicted semantic feature and a real semantic feature corresponding to the image character region, calculate a second loss function according to the predicted character recognition result and a real character recognition result corresponding to the image character region, and train the character representation acquisition module by adjusting a model parameter of the character representation acquisition module using the first loss function and the second loss function;

a second training unit configured to obtain a predicted object class embedding vector of an object in a second training data set by processing the second training data set by an object representation acquisition module, calculate a third loss function according to the predicted object class embedding vector and a real object class vector of the object, and train the object representation acquisition module by adjusting a model parameter of the object representation acquisition module according to the third loss function;

the third training unit is configured to acquire visually enhanced character representations for a third training data set through the trained character representation acquisition module, acquire semantically-oriented object representations for the third training data set through the trained object representation acquisition module, acquire problem representations for problem sentences through a pre-trained language model, and train the first multi-modal interaction network module and the answer prediction module according to the visually enhanced character representations, the semantically-oriented object representations, the problem representations and a previous decoding vector.

7. A text visual question-answering device realized by a text visual question-answering model is characterized in that the text visual question-answering model comprises a character representation acquisition module, an object representation acquisition module, a pre-trained language model, a first multi-mode interaction network module and an answer prediction module, wherein the text visual question-answering device comprises:

a word representation obtaining unit configured to obtain visually enhanced word representations from an image by a word representation obtaining module, wherein the word representation obtaining module comprises an Optical Character Recognition (OCR) module and a word-dependent visual semantic mapping network (TVS) module;

an object representation acquisition unit configured to obtain semantically oriented object representations from the image by an object representation acquisition module;

a problem representation obtaining unit configured to obtain a problem representation from a problem statement corresponding to the image through a pre-trained language model;

a first multi-modal interaction network unit configured to obtain, by a first multi-modal interaction network module, from the visually enhanced text representations, the semantically oriented object representations, the problem representations, and a previous decoded vector, text representation enhancement features corresponding to the visually enhanced text representations and a current predictive decoded vector corresponding to the previous decoded vector;

an answer prediction unit configured to obtain a predicted answer corresponding to the question sentence from the text characterization enhancement feature and the current predictive decoding vector through an answer prediction module.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions,

wherein the processor is configured to execute the instructions to implement the method of training the textual visual question-answering model according to claim 1 or the method of textual visual question-answering according to any one of claims 2 to 5.

9. A computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of training a textual visual question-and-answer model according to claim 1 or the method of textual visual question-and-answer according to any one of claims 2 to 5.

10. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of training a textual visual question-and-answer model according to claim 1 or the method of textual visual question-and-answer according to any one of claims 2 to 5.