CN111858882B

CN111858882B - Text visual question-answering system and method based on concept interaction and associated semantics

Info

Publication number: CN111858882B
Application number: CN202010584144.1A
Authority: CN
Inventors: 高联丽; 李向鹏; 宋井宽
Original assignee: University of Electronic Science and Technology of China; Guizhou University
Current assignee: University of Electronic Science and Technology of China; Guizhou University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2022-08-09
Anticipated expiration: 2040-06-24
Also published as: CN111858882A

Abstract

The invention provides a text visual question-answering system and method based on concept interaction and associated semantics. The method and the device utilize the position relation of the object and the text information in the image to model, then model the text information and the object information through an OCR-object graph convolution network, learn abundant and directive characteristics through a door mechanism to the coding of the relation, and finally accurately pay attention to the object and the text in the image through a converter network, so that more accurate answers are obtained.

Description

Text visual question-answering system and method based on concept interaction and associated semantics

Technical Field

The invention belongs to the technical field of visual question answering, and particularly relates to a text visual question answering system and method based on concept interaction and associated semantics.

Background

With the development of the visual and language interaction field, the text visual question-answering algorithm has made great progress in recent years. As a branch of visual question-answering, the text visual question-answering algorithm places the gravity center of the algorithm on how to mine the relation between text and object in the picture and is used to support question answering, and it is also widely used in real applications, such as visual assistant for disabled people, education assistant for younger children, and so on. Compared with the traditional visual question-answering algorithm, the text visual question-answering algorithm aims to require a model to simultaneously understand visual information and text information in an image, carry out reasoning on the premise of important clues and finally answer a question about text in the image. If these requirements are to be fulfilled, this task will involve knowledge in several areas, such as object detection, text recognition and multi-modal reasoning. Because of the complexity of the system algorithm, the method is an urgent problem to be solved in the aspect of machine intelligence.

Most current visual question-answering algorithms focus their attention primarily on attention mechanisms and graph-based approaches. The method comprises the following steps of calculating the weight of an object or a text in an image, and calculating the weight of the object or the text in the image according to the weight of the object or the text in the image. The graph-based method can successfully construct a graph network based on the relationship for the elements in the image, and the algorithm can carry out reasoning according to the relationship of the elements in the graph so as to obtain an answer. These methods, while performing well on visual question-and-answer datasets, are very ineffective once they encounter text-related questions because these algorithms do not have the ability to read the text in the picture. Besides, most of the current methods are not deep enough for mining the relationship between text information and object information in pictures, most models only simply take the information as simple input to be fed into the models, but the application finds that the relationship information is very important in the actual reasoning process.

Disclosure of Invention

Aiming at the defects in the prior art, the text visual question-answering system and method based on concept interaction and associated semantics solve the problem that the object and visual relation is ignored in the prior art.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides a text visual question-answering system based on concept interaction and associated semantics, which comprises an object position extraction module, a first full connection layer connected with the object position extraction module, a text information extraction module, a second full connection layer connected with the text information extraction module, an OCR-object graph convolution network respectively connected with the first full connection layer and the second full connection layer, a gate step mechanism graph convolution network connected with the OCR-object graph convolution network, and a converter network connected with the gate step mechanism graph convolution network, wherein the converter network is connected with a bidirectional converter characterization encoder BERT;

the object position extraction module is used for extracting visual features in the image and corresponding position information by utilizing a pre-trained fast-RCNN fast region object detector model;

the first full-connection layer is used for fusing the visual features and the corresponding position information thereof to obtain the position information of the object;

the text information extraction module is used for detecting text information in the image and corresponding position information thereof by using an OCR recognition model;

the second full-connection layer is used for fusing the text information and the corresponding position information thereof to obtain the position information of the text;

the OCR-object graph convolution network is used for extracting the overlapping relation between the position information of the object and the position information of the text according to the position information of the object and the position information of the text;

the multi-gate step mechanism graph convolution network is used for coding the overlapping relation;

the bidirectional converter represents a coder BERT and is used for extracting problem information of a problem text;

and the converter network is used for obtaining the text visual answer according to the question information and the coding information.

The invention has the beneficial effects that: the method can effectively utilize the incidence relation between the text and the object in the aspect of text visual question answering, obtains better accuracy, and has better effect than the current method on the data set.

Further, the OCR-object graph convolution network comprises an OCR-OCR sub-graph, an object-object sub-graph and an OCR-object sub-graph;

and the OCR-OCR subgraph is used for modeling the relation between the texts detected in the image and establishing the relation between any two texts.

The object-object sub-graph is used for modeling the relationship between the objects detected in the image and establishing the relationship between any two objects.

And the OCR-object subgraph is used for modeling the relation between the text and the object to obtain the overlapping relation between the position information of the object and the position information of the text.

The beneficial effects of the further scheme are as follows: through feature enhancement of the relationship, the generated feature has stronger directivity, which is equivalent to using surrounding objects or texts as modifiers, such as "mobile phone with apple brand", "license plate number of automobile EA 286", and the like. The features thus generated are more conducive to reasoning about complex relationships.

Based on the system, the invention also provides a text visual question-answering method based on concept interaction and associated semantics, which comprises the following steps:

s1, extracting visual features and corresponding position information in the image by using a pre-trained fast-RCNN fast region object detector model, and fusing the visual features and the corresponding position information by using a first full-connection layer to obtain position information of an object;

s2, detecting text information in the image and corresponding position information thereof by using an OCR recognition model, and fusing the text information and the corresponding position information thereof by using a second full-connection layer to obtain the position information of the text;

s3, extracting problem information of the problem text by using a bidirectional converter representation encoder BERT;

s4, obtaining the overlapping relation between the position information of the object and the position information of the text by using an OCR-object graph convolution network according to the position information of the object and the position information of the text;

s5, encoding the overlapping relation by using a multi-gate step mechanism graph convolution network;

and S6, inputting the question information and the coding information into a converter network to obtain a text visual answer, and completing the text visual question and answer based on concept interaction and associated semantics.

Further, the expression fused in the step S1 is as follows:

wherein, F ^v Representing the fused feature matrix, LayerNorm the regularization layer, W _v And

respectively representing parameters needing to be updated and trained in the neural network, V representing the visual characteristics of the detected object, B ^v Indicating the position information of the detected outer frame of the object.

The beneficial effects of the further scheme are as follows: and fusing various attribute information corresponding to vision, and increasing the diversity of information contained in the visual features.

Still further, the step S2 includes the steps of:

s201, extracting text information of letter level in each detection image by using a FastText rapid text model;

s202, extracting appearance information of texts in each detection image by using a fast-RCNN rapid region object detector;

s203, detecting text information at letter level and position information corresponding to appearance information in the text by using an OCR model;

and S204, fusing the text information at the letter level, the appearance information in the text and the position information by using a second full-connection layer to obtain the position information of the text.

The beneficial effects of the further scheme are as follows: there are diverse features obtained with extracted text: text features, visual features, and corresponding location features.

Still further, the expression of the position information of the text in step S204 is as follows:

wherein, F ^o Indicating the location information of the text, LayerNorm indicates the regularization layer,

and

respectively representing trainable parameters of the model, O ^ft Text information indicating the detected text, O ^fr Visual information representing the detected text, B ^o Position information indicating the detected text outer border.

The beneficial effects of the above further scheme are: and fusing various attribute information corresponding to the detected text, and increasing the diversity of information contained in the visual features.

Still further, the expression of the question information of the question text extracted in step S3 is as follows:

Q＝{q ₁ ,...,q _K }＝BERT(w ₁ ,...,w _K )

wherein Q represents problem information, Q _K Representing semantic information of the K-th word, BERT representing a bidirectional transducer characterizing the encoder, w _K Representing a one-hot code represented by the K-th word.

The beneficial effects of the above further scheme are: and coding the question information and extracting important semantic information in the question.

Still further, the expression of the OCR-object graph convolutional network in step S4 is as follows:

wherein A represents an OCR-object correlation diagram, A ^v Representing object-object subgraph, A ^o Represented as an OCR-OCR sub-graph, A ^r Represented as an OCR-object sub-graph, A ^rT Is represented by A ^r The conjugate transpose of (c).

The beneficial effects of the further scheme are as follows: and constructing an OCR-object association graph according to the position information and the information of the semantic level.

Still further, the multi-gate step graph convolution network in step S5 is formed by executing step J with a single-gate step graph convolution network;

the expression of the graph convolution network of the single-step door mechanism is as follows:

X _t+1 ＝GatedGCN(A,X _t )

wherein, X _t Input to the gated graph convolution network, X, representing step t _t+1 The output of the gated graph convolution network of the t step and the input of the gated graph convolution network of the t +1 step are shown, A represents an OCR-object correlation graph, and the gated graph convolution network of a GatedGCN single step.

The beneficial effects of the further scheme are as follows: and coding the relation information of the detection text and the detection object into the characteristics, so that the generated characteristics have directivity.

Still further, the expression of the text visual answer in S6 is as follows:

wherein, y _t A visual response to the text is represented,

representing a probability distribution of answers selected from the answer vocabulary,

representing a probability distribution of answers to pick from the test text,

and

respectively representing trainable parameters in the network, O ^joint An output representing corresponding location coordinates in the output of the multi-gate step map convolutional network,

representing the last hidden unit in the output of the multi-gate step map convolutional network, Concat represents the merge join operation.

The beneficial effects of the further scheme are as follows: through plagiarism mechanism, the final answer is determined to be selected from the answer table or selected from the detected text, and the flexibility of the question-answer model is increased.

Drawings

FIG. 1 is a system block diagram of the present invention.

FIG. 2 is a flow chart of the method of the present invention.

Fig. 3 is an algorithm framework diagram in the present embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Example 1

As shown in fig. 1, the present invention provides a text visual question-answering system based on concept interaction and associated semantics, which includes an object position extraction module, a first full connection layer connected to the object position extraction module, a text information extraction module, a second full connection layer connected to the text information extraction module, an OCR-object graph convolution network connected to the first full connection layer and the second full connection layer, a gate-step mechanism graph convolution network connected to the OCR-object graph convolution network, and a converter network connected to the gate-step mechanism graph convolution network, wherein the converter network is connected to a bidirectional converter characterization encoder BERT; the object position extraction module is used for extracting visual features in the image and corresponding position information by utilizing a pre-trained fast region object detector fast-RCNN model; the first full-connection layer is used for fusing the visual features and the corresponding position information thereof to obtain the position information of the object; the text information extraction module is used for detecting text information in the image and corresponding position information by using the OCR recognition model; the second full-connection layer is used for fusing the text information and the corresponding position information thereof to obtain the position information of the text; the OCR-object graph convolution network is used for extracting the overlapping relation between the position information of the object and the position information of the text according to the position information of the object and the position information of the text; the multi-gate step mechanism graph convolution network is used for coding the overlapping relation; the bidirectional converter characterizes the encoder BERT and is used for extracting problem information of a problem text; and the converter network is used for obtaining the text visual answer according to the question information and the coding information. The OCR-object graph convolution network comprises an OCR-OCR sub-graph, an object-object sub-graph and an OCR-object sub-graph; the OCR-OCR subgraph is mainly used for modeling the relation between texts detected in an image and establishing a relation between any two texts. The object-object sub-graph is mainly used for modeling the relationship between objects detected in the image and establishing a relationship between any two objects. The OCR-object subgraph is mainly used for modeling the relationship between text and an object detected in an image, and if the text region and the object region are overlapped in space, the relationship is supposed to exist between the text region and the object region, and if the text region and the object region are not overlapped, the relationship is supposed to not exist between the text region and the object region.

In this embodiment, the present invention mainly provides a Text Visual Question-Answering algorithm (Concept Interaction: expanding the Relational Semantics for Text-based Visual Question Answering) capable of Concept Interaction and associated semantic exploration for a Text Visual Question-Answering problem, and the present invention performs spatial modeling on the relationship between an object and a Visual object that have been neglected in previous work, and if the Text and the object overlap spatially, the present application believes that there is a relationship between the two, such as the name of a book and the license plate number of an automobile. The name of the detected book must be overlapped with the book, the license plate number must be overlapped with the automobile, an OCR-object graph convolution network is built based on the observation, graph convolution operation based on a door mechanism is carried out according to the relation, the output characteristics of the network can ensure that the node of the network can be described by surrounding adjacent nodes, the final characteristics can have directivity, and the directivity is required by a visual question-answering algorithm. The features and the question information after convolution are jointly input into a converter network according to the question information, and the converter network can carry out concept interaction on objects and texts needing attention according to the question information and explore final answers of the questions. Meanwhile, the generated answer length can be ensured to be variable, and the method is more suitable for the condition of answering questions in reality. The method verifies the system on the current mainstream data set, obtains the best current result, and proves that the method obtains better effect and more accurate predicted answer than the current mainstream method. The method of the application obtains 41.14% accuracy on the validation set of the TextVQA data set and 40.77% accuracy on the test set. On the ST-VQA data set, the accuracy of the method also reaches 41.1%, the average normalized Lewenstein similarity also reaches 0.500, and the best current result is obtained.

Example 2

Based on the system, the invention also provides a text visual question-answering method based on concept interaction and associated semantics, the basic idea is to utilize the position relationship between an object and text information in an image to model the relationship, then model the text information and the object information through an OCR-object graph convolution network, learn richer and more directional characteristics for the coding of the relationship based on a door mechanism, and then accurately pay attention to the object and the text in the image through a converter network according to question information, thereby obtaining more accurate answers. As shown in fig. 2-3, the implementation method is as follows:

s1, extracting visual features and corresponding position information in the image by using a pre-trained fast region object detector, namely a fast-RCNN model, and fusing the visual features and the corresponding position information by using a first full-connection layer to obtain the position information of the object;

in this embodiment, for a given picture, the application first uses a pre-trained fast-RCNN model to extract visual features. The fast-RCNN is a training on Visual Genome dataset that captures the main objects appearing in the picture. For each picture, the fast-RCNN model is used for extracting the first 100 local object features according to the confidence degree, the local features correspond to 100 different regions in the picture, each local feature is represented by a 2048-dimensional vector, the position of each region is also represented by a four-dimensional vector, and the normalized coordinate positions of two points, namely the lower left point and the upper right point of a region frame are represented, so that a 100 x 2048 feature matrix can be obtained as a visual representation and the corresponding 100 x 4 position information. In order to effectively fuse the two kinds of information, a full connection layer is used to fuse the two kinds of information, and the formula of fusion is as follows:

representing parameters of the neural network that require updated training, V representing visual characteristics of the detected object, B ^v Indicating the position information of the detected outer frame of the object.

S2, detecting text information in the image and corresponding position information thereof by using an OCR recognition model, and fusing the text information and the corresponding position information thereof by using a second full-connection layer to obtain the position information of the text, wherein the implementation method comprises the following steps:

In the embodiment, for a given picture, the application uses a trained OCR recognition model, Rosetta (Rosetta), to detect text information and position information appearing in an image. Considering that text information also comprises text information at letter level and appearance information corresponding to the text information, the text information extraction method extracts semantic information at letter level through fast text model FastText and extracts appearance information of the text through fast region object detector Faster-RCNN aiming at each detected text, and similarly, the text information, appearance representation of the text information and corresponding position information of character sets are fused by using a full connection layer, and a fusion formula is as follows:

and

respectively representing trainable parameters of the model, O ^ft Text information indicating the detected text, O ^fr Visual information representing the detected text, B ^o Position information of the detected text outer border of the representation.

in this embodiment, for a given problem, a Bidirectional converter Representation Encoder (BERT) is used to extract features of a problem text, the Bidirectional converter Representation Encoder is pre-trained, the problem text is finely tuned in a training process of the present application, and the extraction process of the problem features is as follows:

Q＝{q ₁ ,...,q _K }＝BERT(w ₁ ,...,w _K )

in this embodiment, in the real-world example, if the text in the picture and the object overlap, there is a relationship between the two, so an OCR-object graph convolution network is defined according to the overlap relationship between the position of the object in the picture and the position of the text, and the OCR-object graph convolution network includes three sub-graphs: an OCR-OCR map, an object-object map, and an OCR-object map. Considering the relationship between the OCR inner parts, the relationship between the objects is at the semantic level, and it is assumed that there is a semantic relationship between the two, so the application defines the OCR-OCR graph and the object-object graph as an identity matrix. The OCR-object graph can be defined according to an overlapping relation, if the ith text and the jth object have an overlapping relation, the corresponding A is carried out _i,j 1, if there is no overlapping relationship between the two, it corresponds to A _i,j Is 0. The final OCR-object graph convolution network A is thus defined as follows:

wherein A represents an OCR-object correlation diagram, A ^v Representing object-object diagram, A ^o Represented as an OCR-OCR graph, A ^r Represented as an OCR-object graph, A ^rT Is represented by A ^r The conjugate transpose of (c).

in the embodiment, in order to encode the incidence relation between the text and the object into the features, a multi-step door machine is used for constructing the graph convolution network, the single-step graph convolution network can only learn the semantic information of the adjacent nodes, but the logic in real life is possibly complex, and the single-step reasoning cannot meet the display requirement, so that the method is expanded into the multi-step graph convolution network. Node feature X of OCR-object graph ₀ Expressed as:

X ₀ ＝[F ^v ,F ^o ]

wherein, F ^v Characteristic of the detected object, F ^o Representing features of the detected text.

The calculation process of the single-step gate mechanism graph convolution network element is as follows:

Y＝AW _g X _t

U＝σ(W ₁ Y+W ₂ X _t )

R＝σ(W ₃ Y+W ₄ X _t )

wherein A represents an OCR-object correlation diagram, W _g 、W ₁ 、W ₂ 、W ₃ 、W ₄ 、W ₅ And W ₆ Respectively representing parameters which can be learned in the model, Y representing the primary characteristics which are fused by the relational graph, and sigma (DEG) representing a Sigmoid functionTanh (-) denotes a hyperbolic tangent function, U denotes an update gate, R denotes a reset gate,

representing a characteristic signal, X, passing through a reset gate _t Shown is the input to the gated graph convolution network of step t, X _t+1 The output is shown, while it is also the input for step t + 1.

The application simplifies the single-step door graph convolution network into:

X _t+1 ＝GatedGCN(A,X _t )

wherein, X _t Shown is the input to the gated graph convolution network of step t, X _t+1 Showing its output and its input at step t +1, A showing the OCR-object correlation diagram, and finally performing step J to obtain the fusion feature F ^joint 。

In order to expand the correlation length of the graph, the gate graph convolution neural network is executed with J steps to ensure that the correlation length is J, and therefore the final output characteristic is X _J 。

In this embodiment, because the questions needed in the text visual question-answering indicate which information in the pictures the model needs to pay attention to, a converter network related to the questions is designed, the converter network is composed of multiple converter layers, the four concepts of visual image information, text information, question information and previous output are jointly input into the concept interaction converter network, and the specific calculation process is as shown in the formula:

F _t ^joint ＝[X _J ,Q,P _t-1 ]

F _t ^final ＝Transformer(F _t ^joint )

wherein the fusion characteristic obtained after the step J is finally executed is X _J Q denotes the problem feature after coding, P _t-1 Represents the first t-1Output of the first t-1 step of the answer, F _t ^final Representing the characteristics of the t-th step after being coded by the Transformer, the Transformer represents a multi-layer Transformer network, F _t ^joint Representing the characteristics of the t step before being coded by the converter.

In this example, the final feature expression F is taken _t ^final Last hidden layer unit of

As decoded hidden layer state and uses this hidden layer state to predict the final answer:

wherein the content of the first and second substances,

representing the probability distribution when an answer is selected from the list of answer words,

representing trainable parameters in the network.

However, considering that the answer to the question is very likely to select the text in a picture as the answer, a plagiarism idea is introduced to help the model select the final answer directly from the text, and it is assumed that the OCR feature is intercepted from the final coding feature to be O ^joint And the final selected result answer calculation process is as follows:

wherein the content of the first and second substances,

representing the probability distribution of the answer when it picks the answer from the test text,

and

respectively representing trainable parameters in the network, O ^joint Representing an output representing OCR corresponding location coordinates in the output of the gated graph convolutional network,

output F representing a gated graph convolutional network ^joint The last hidden unit in (1).

Thus, two answer feature distributions are obtained, the two answer feature distributions are connected together, and finally, the answer with the highest score is selected as the generated answer in the step t:

wherein, y _t Showing the distribution of the final answer,

representing the probability distribution when selecting an answer from the list of answer words,

the probability distribution of the answer when it is selected from the test text is shown. In addition, in the training phase of the model, the present embodiment uses the cross entropy of binarization as the final loss function, and in the inference phase, when the answer generates "END", the model stops continuing the operation, and the final answer is the answer to the question.

Claims

1. A text visual question-answering system based on concept interaction and associated semantics is characterized by comprising an object position extraction module, a first full connection layer connected with the object position extraction module, a text information extraction module, a second full connection layer connected with the text information extraction module, an OCR-object graph convolution network respectively connected with the first full connection layer and the second full connection layer, a multi-gate-step mechanism graph convolution network connected with the OCR-object graph convolution network, and a converter network connected with the multi-gate-step mechanism graph convolution network, wherein the converter network is connected with a bidirectional converter representation encoder BERT;

the bidirectional converter represents an encoder BERT and is used for extracting problem information of a problem text;

the converter network is used for obtaining text visual answers according to the question information and the coding information;

the concept interaction is to input the four concepts of visual image information, text information, problem information and previous output into a concept interaction converter network, and the specific calculation process is shown as a formula:

F _t ^joint ＝[X _J ,Q,P _t-1 ]

F _t ^final ＝Transformer(F _t ^joint )

wherein the fusion characteristic obtained after the step J is finally executed is X _J Q denotes the problem feature after coding, P _t-1 Output of the first t-1 step representing the first t-1 answers, F _t ^final Representing the characteristics of the t-th step after being coded by the Transformer, the Transformer represents a multi-layer Transformer network, F _t ^joint Representing the characteristics of the t step before the coding of the converter;

the multi-door step mechanism graph convolution network is formed by executing J steps by a single-door mechanism graph convolution network;

X _t+1 ＝GatedGCN(A,X _t )

2. The concept-interaction and associated-semantic-based text visual question-answering system of claim 1, wherein the OCR-object graph volume network comprises an OCR-OCR subgraph, an object-object subgraph and an OCR-object subgraph;

the OCR-OCR sub-image is used for modeling the relation between texts detected in the image and establishing the relation between any two texts;

the object-object subgraph is used for modeling the relationship between the objects detected in the image and establishing the relationship between any two objects;

3. A text visual question-answering method based on concept interaction and associated semantics is characterized by comprising the following steps:

s4, obtaining the overlapping relation between the position information of the object and the position information of the text by utilizing an OCR-object graph convolution network according to the position information of the object and the position information of the text;

s6, inputting the question information and the coding information into a converter network to obtain a text visual answer, and completing the text visual question and answer based on concept interaction and associated semantics; wherein the content of the first and second substances,

the concept interaction is that the four concepts of visual image information, text information, problem information and previous output are input into a concept interaction converter network together, and the specific calculation process is shown as a formula:

F _t ^joint ＝[X _J ,Q,P _t-1 ]

F _t ^final ＝Transformer(F _t ^joint )

the multi-step graph convolution network in the step S5 is formed by executing J steps by the single-step graph convolution network;

X _t+1 ＝GatedGCN(A,X _t )

4. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression fused in the step S1 is as follows:

F ^v ＝LayerNorm(W _v V+W _b ^f B ^v )

wherein, F ^v Representing the fused feature matrix, LayerNorm the regularization layer, W _v And W _b ^f Respectively representing parameters needing to be updated and trained in the neural network, V representing the visual characteristics of the detected object, B ^v Indicating the position information of the detected outer frame of the object.

5. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein said step S2 comprises the steps of:

6. The text visual question-answering method based on concept interaction and associated semantics of claim 5, wherein the expression of the position information of the text in the step S204 is as follows:

and

7. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the question information of the question text extracted in the step S3 is as follows:

Q＝{q ₁ ,...,q _K }＝BERT(w ₁ ,...,w _K )

wherein Q represents problem information, Q _K Representing semantic information of the Kth word, BERT representing a bidirectional converter characterizing the encoder, w _K Representing a one-hot code represented by the K-th word.

8. The text visual question answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the OCR-object graph convolutional network in the step S4 is as follows:

wherein A represents an OCR-object correlation diagram，A ^v Representing object-object subgraph, A ^o Represented as an OCR-OCR sub-graph, A ^r Represented as an OCR-object sub-graph, A ^rT Is shown as A ^r The conjugate transpose of (c).

9. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the text visual answer in the S6 is as follows:

wherein, y _t A visual response to the text is represented,

representing a probability distribution of answers to pick from the test text,

and