CN111858882B - Text visual question-answering system and method based on concept interaction and associated semantics - Google Patents

Text visual question-answering system and method based on concept interaction and associated semantics Download PDF

Info

Publication number
CN111858882B
CN111858882B CN202010584144.1A CN202010584144A CN111858882B CN 111858882 B CN111858882 B CN 111858882B CN 202010584144 A CN202010584144 A CN 202010584144A CN 111858882 B CN111858882 B CN 111858882B
Authority
CN
China
Prior art keywords
text
information
position information
ocr
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010584144.1A
Other languages
Chinese (zh)
Other versions
CN111858882A (en
Inventor
高联丽
李向鹏
宋井宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Guizhou University
Original Assignee
University of Electronic Science and Technology of China
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, Guizhou University filed Critical University of Electronic Science and Technology of China
Priority to CN202010584144.1A priority Critical patent/CN111858882B/en
Publication of CN111858882A publication Critical patent/CN111858882A/en
Application granted granted Critical
Publication of CN111858882B publication Critical patent/CN111858882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention provides a text visual question-answering system and method based on concept interaction and associated semantics. The method and the device utilize the position relation of the object and the text information in the image to model, then model the text information and the object information through an OCR-object graph convolution network, learn abundant and directive characteristics through a door mechanism to the coding of the relation, and finally accurately pay attention to the object and the text in the image through a converter network, so that more accurate answers are obtained.

Description

Text visual question-answering system and method based on concept interaction and associated semantics
Technical Field
The invention belongs to the technical field of visual question answering, and particularly relates to a text visual question answering system and method based on concept interaction and associated semantics.
Background
With the development of the visual and language interaction field, the text visual question-answering algorithm has made great progress in recent years. As a branch of visual question-answering, the text visual question-answering algorithm places the gravity center of the algorithm on how to mine the relation between text and object in the picture and is used to support question answering, and it is also widely used in real applications, such as visual assistant for disabled people, education assistant for younger children, and so on. Compared with the traditional visual question-answering algorithm, the text visual question-answering algorithm aims to require a model to simultaneously understand visual information and text information in an image, carry out reasoning on the premise of important clues and finally answer a question about text in the image. If these requirements are to be fulfilled, this task will involve knowledge in several areas, such as object detection, text recognition and multi-modal reasoning. Because of the complexity of the system algorithm, the method is an urgent problem to be solved in the aspect of machine intelligence.
Most current visual question-answering algorithms focus their attention primarily on attention mechanisms and graph-based approaches. The method comprises the following steps of calculating the weight of an object or a text in an image, and calculating the weight of the object or the text in the image according to the weight of the object or the text in the image. The graph-based method can successfully construct a graph network based on the relationship for the elements in the image, and the algorithm can carry out reasoning according to the relationship of the elements in the graph so as to obtain an answer. These methods, while performing well on visual question-and-answer datasets, are very ineffective once they encounter text-related questions because these algorithms do not have the ability to read the text in the picture. Besides, most of the current methods are not deep enough for mining the relationship between text information and object information in pictures, most models only simply take the information as simple input to be fed into the models, but the application finds that the relationship information is very important in the actual reasoning process.
Disclosure of Invention
Aiming at the defects in the prior art, the text visual question-answering system and method based on concept interaction and associated semantics solve the problem that the object and visual relation is ignored in the prior art.
In order to achieve the above purpose, the invention adopts the technical scheme that:
the scheme provides a text visual question-answering system based on concept interaction and associated semantics, which comprises an object position extraction module, a first full connection layer connected with the object position extraction module, a text information extraction module, a second full connection layer connected with the text information extraction module, an OCR-object graph convolution network respectively connected with the first full connection layer and the second full connection layer, a gate step mechanism graph convolution network connected with the OCR-object graph convolution network, and a converter network connected with the gate step mechanism graph convolution network, wherein the converter network is connected with a bidirectional converter characterization encoder BERT;
the object position extraction module is used for extracting visual features in the image and corresponding position information by utilizing a pre-trained fast-RCNN fast region object detector model;
the first full-connection layer is used for fusing the visual features and the corresponding position information thereof to obtain the position information of the object;
the text information extraction module is used for detecting text information in the image and corresponding position information thereof by using an OCR recognition model;
the second full-connection layer is used for fusing the text information and the corresponding position information thereof to obtain the position information of the text;
the OCR-object graph convolution network is used for extracting the overlapping relation between the position information of the object and the position information of the text according to the position information of the object and the position information of the text;
the multi-gate step mechanism graph convolution network is used for coding the overlapping relation;
the bidirectional converter represents a coder BERT and is used for extracting problem information of a problem text;
and the converter network is used for obtaining the text visual answer according to the question information and the coding information.
The invention has the beneficial effects that: the method can effectively utilize the incidence relation between the text and the object in the aspect of text visual question answering, obtains better accuracy, and has better effect than the current method on the data set.
Further, the OCR-object graph convolution network comprises an OCR-OCR sub-graph, an object-object sub-graph and an OCR-object sub-graph;
and the OCR-OCR subgraph is used for modeling the relation between the texts detected in the image and establishing the relation between any two texts.
The object-object sub-graph is used for modeling the relationship between the objects detected in the image and establishing the relationship between any two objects.
And the OCR-object subgraph is used for modeling the relation between the text and the object to obtain the overlapping relation between the position information of the object and the position information of the text.
The beneficial effects of the further scheme are as follows: through feature enhancement of the relationship, the generated feature has stronger directivity, which is equivalent to using surrounding objects or texts as modifiers, such as "mobile phone with apple brand", "license plate number of automobile EA 286", and the like. The features thus generated are more conducive to reasoning about complex relationships.
Based on the system, the invention also provides a text visual question-answering method based on concept interaction and associated semantics, which comprises the following steps:
s1, extracting visual features and corresponding position information in the image by using a pre-trained fast-RCNN fast region object detector model, and fusing the visual features and the corresponding position information by using a first full-connection layer to obtain position information of an object;
s2, detecting text information in the image and corresponding position information thereof by using an OCR recognition model, and fusing the text information and the corresponding position information thereof by using a second full-connection layer to obtain the position information of the text;
s3, extracting problem information of the problem text by using a bidirectional converter representation encoder BERT;
s4, obtaining the overlapping relation between the position information of the object and the position information of the text by using an OCR-object graph convolution network according to the position information of the object and the position information of the text;
s5, encoding the overlapping relation by using a multi-gate step mechanism graph convolution network;
and S6, inputting the question information and the coding information into a converter network to obtain a text visual answer, and completing the text visual question and answer based on concept interaction and associated semantics.
The invention has the beneficial effects that: the method can effectively utilize the incidence relation between the text and the object in the aspect of text visual question answering, obtains better accuracy, and has better effect than the current method on the data set.
Further, the expression fused in the step S1 is as follows:
Figure BDA0002553991820000041
wherein, F v Representing the fused feature matrix, LayerNorm the regularization layer, W v And
Figure BDA0002553991820000042
respectively representing parameters needing to be updated and trained in the neural network, V representing the visual characteristics of the detected object, B v Indicating the position information of the detected outer frame of the object.
The beneficial effects of the further scheme are as follows: and fusing various attribute information corresponding to vision, and increasing the diversity of information contained in the visual features.
Still further, the step S2 includes the steps of:
s201, extracting text information of letter level in each detection image by using a FastText rapid text model;
s202, extracting appearance information of texts in each detection image by using a fast-RCNN rapid region object detector;
s203, detecting text information at letter level and position information corresponding to appearance information in the text by using an OCR model;
and S204, fusing the text information at the letter level, the appearance information in the text and the position information by using a second full-connection layer to obtain the position information of the text.
The beneficial effects of the further scheme are as follows: there are diverse features obtained with extracted text: text features, visual features, and corresponding location features.
Still further, the expression of the position information of the text in step S204 is as follows:
wherein, F o Indicating the location information of the text, LayerNorm indicates the regularization layer,
Figure BDA0002553991820000051
and
Figure BDA0002553991820000052
respectively representing trainable parameters of the model, O ft Text information indicating the detected text, O fr Visual information representing the detected text, B o Position information indicating the detected text outer border.
The beneficial effects of the above further scheme are: and fusing various attribute information corresponding to the detected text, and increasing the diversity of information contained in the visual features.
Still further, the expression of the question information of the question text extracted in step S3 is as follows:
Q={q 1 ,...,q K }=BERT(w 1 ,...,w K )
wherein Q represents problem information, Q K Representing semantic information of the K-th word, BERT representing a bidirectional transducer characterizing the encoder, w K Representing a one-hot code represented by the K-th word.
The beneficial effects of the above further scheme are: and coding the question information and extracting important semantic information in the question.
Still further, the expression of the OCR-object graph convolutional network in step S4 is as follows:
Figure BDA0002553991820000053
wherein A represents an OCR-object correlation diagram, A v Representing object-object subgraph, A o Represented as an OCR-OCR sub-graph, A r Represented as an OCR-object sub-graph, A rT Is represented by A r The conjugate transpose of (c).
The beneficial effects of the further scheme are as follows: and constructing an OCR-object association graph according to the position information and the information of the semantic level.
Still further, the multi-gate step graph convolution network in step S5 is formed by executing step J with a single-gate step graph convolution network;
the expression of the graph convolution network of the single-step door mechanism is as follows:
X t+1 =GatedGCN(A,X t )
wherein, X t Input to the gated graph convolution network, X, representing step t t+1 The output of the gated graph convolution network of the t step and the input of the gated graph convolution network of the t +1 step are shown, A represents an OCR-object correlation graph, and the gated graph convolution network of a GatedGCN single step.
The beneficial effects of the further scheme are as follows: and coding the relation information of the detection text and the detection object into the characteristics, so that the generated characteristics have directivity.
Still further, the expression of the text visual answer in S6 is as follows:
Figure BDA0002553991820000061
Figure BDA0002553991820000062
Figure BDA0002553991820000063
wherein, y t A visual response to the text is represented,
Figure BDA0002553991820000064
representing a probability distribution of answers selected from the answer vocabulary,
Figure BDA0002553991820000065
representing a probability distribution of answers to pick from the test text,
Figure BDA0002553991820000066
and
Figure BDA0002553991820000067
respectively representing trainable parameters in the network, O joint An output representing corresponding location coordinates in the output of the multi-gate step map convolutional network,
Figure BDA0002553991820000068
representing the last hidden unit in the output of the multi-gate step map convolutional network, Concat represents the merge join operation.
The beneficial effects of the further scheme are as follows: through plagiarism mechanism, the final answer is determined to be selected from the answer table or selected from the detected text, and the flexibility of the question-answer model is increased.
Drawings
FIG. 1 is a system block diagram of the present invention.
FIG. 2 is a flow chart of the method of the present invention.
Fig. 3 is an algorithm framework diagram in the present embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Example 1
As shown in fig. 1, the present invention provides a text visual question-answering system based on concept interaction and associated semantics, which includes an object position extraction module, a first full connection layer connected to the object position extraction module, a text information extraction module, a second full connection layer connected to the text information extraction module, an OCR-object graph convolution network connected to the first full connection layer and the second full connection layer, a gate-step mechanism graph convolution network connected to the OCR-object graph convolution network, and a converter network connected to the gate-step mechanism graph convolution network, wherein the converter network is connected to a bidirectional converter characterization encoder BERT; the object position extraction module is used for extracting visual features in the image and corresponding position information by utilizing a pre-trained fast region object detector fast-RCNN model; the first full-connection layer is used for fusing the visual features and the corresponding position information thereof to obtain the position information of the object; the text information extraction module is used for detecting text information in the image and corresponding position information by using the OCR recognition model; the second full-connection layer is used for fusing the text information and the corresponding position information thereof to obtain the position information of the text; the OCR-object graph convolution network is used for extracting the overlapping relation between the position information of the object and the position information of the text according to the position information of the object and the position information of the text; the multi-gate step mechanism graph convolution network is used for coding the overlapping relation; the bidirectional converter characterizes the encoder BERT and is used for extracting problem information of a problem text; and the converter network is used for obtaining the text visual answer according to the question information and the coding information. The OCR-object graph convolution network comprises an OCR-OCR sub-graph, an object-object sub-graph and an OCR-object sub-graph; the OCR-OCR subgraph is mainly used for modeling the relation between texts detected in an image and establishing a relation between any two texts. The object-object sub-graph is mainly used for modeling the relationship between objects detected in the image and establishing a relationship between any two objects. The OCR-object subgraph is mainly used for modeling the relationship between text and an object detected in an image, and if the text region and the object region are overlapped in space, the relationship is supposed to exist between the text region and the object region, and if the text region and the object region are not overlapped, the relationship is supposed to not exist between the text region and the object region.
In this embodiment, the present invention mainly provides a Text Visual Question-Answering algorithm (Concept Interaction: expanding the Relational Semantics for Text-based Visual Question Answering) capable of Concept Interaction and associated semantic exploration for a Text Visual Question-Answering problem, and the present invention performs spatial modeling on the relationship between an object and a Visual object that have been neglected in previous work, and if the Text and the object overlap spatially, the present application believes that there is a relationship between the two, such as the name of a book and the license plate number of an automobile. The name of the detected book must be overlapped with the book, the license plate number must be overlapped with the automobile, an OCR-object graph convolution network is built based on the observation, graph convolution operation based on a door mechanism is carried out according to the relation, the output characteristics of the network can ensure that the node of the network can be described by surrounding adjacent nodes, the final characteristics can have directivity, and the directivity is required by a visual question-answering algorithm. The features and the question information after convolution are jointly input into a converter network according to the question information, and the converter network can carry out concept interaction on objects and texts needing attention according to the question information and explore final answers of the questions. Meanwhile, the generated answer length can be ensured to be variable, and the method is more suitable for the condition of answering questions in reality. The method verifies the system on the current mainstream data set, obtains the best current result, and proves that the method obtains better effect and more accurate predicted answer than the current mainstream method. The method of the application obtains 41.14% accuracy on the validation set of the TextVQA data set and 40.77% accuracy on the test set. On the ST-VQA data set, the accuracy of the method also reaches 41.1%, the average normalized Lewenstein similarity also reaches 0.500, and the best current result is obtained.
Example 2
Based on the system, the invention also provides a text visual question-answering method based on concept interaction and associated semantics, the basic idea is to utilize the position relationship between an object and text information in an image to model the relationship, then model the text information and the object information through an OCR-object graph convolution network, learn richer and more directional characteristics for the coding of the relationship based on a door mechanism, and then accurately pay attention to the object and the text in the image through a converter network according to question information, thereby obtaining more accurate answers. As shown in fig. 2-3, the implementation method is as follows:
s1, extracting visual features and corresponding position information in the image by using a pre-trained fast region object detector, namely a fast-RCNN model, and fusing the visual features and the corresponding position information by using a first full-connection layer to obtain the position information of the object;
in this embodiment, for a given picture, the application first uses a pre-trained fast-RCNN model to extract visual features. The fast-RCNN is a training on Visual Genome dataset that captures the main objects appearing in the picture. For each picture, the fast-RCNN model is used for extracting the first 100 local object features according to the confidence degree, the local features correspond to 100 different regions in the picture, each local feature is represented by a 2048-dimensional vector, the position of each region is also represented by a four-dimensional vector, and the normalized coordinate positions of two points, namely the lower left point and the upper right point of a region frame are represented, so that a 100 x 2048 feature matrix can be obtained as a visual representation and the corresponding 100 x 4 position information. In order to effectively fuse the two kinds of information, a full connection layer is used to fuse the two kinds of information, and the formula of fusion is as follows:
Figure BDA0002553991820000101
wherein, F v Representing the fused feature matrix, LayerNorm the regularization layer, W v And
Figure BDA0002553991820000102
representing parameters of the neural network that require updated training, V representing visual characteristics of the detected object, B v Indicating the position information of the detected outer frame of the object.
S2, detecting text information in the image and corresponding position information thereof by using an OCR recognition model, and fusing the text information and the corresponding position information thereof by using a second full-connection layer to obtain the position information of the text, wherein the implementation method comprises the following steps:
s201, extracting text information of letter level in each detection image by using a FastText rapid text model;
s202, extracting appearance information of texts in each detection image by using a fast-RCNN rapid region object detector;
s203, detecting text information at letter level and position information corresponding to appearance information in the text by using an OCR model;
and S204, fusing the text information at the letter level, the appearance information in the text and the position information by using a second full-connection layer to obtain the position information of the text.
In the embodiment, for a given picture, the application uses a trained OCR recognition model, Rosetta (Rosetta), to detect text information and position information appearing in an image. Considering that text information also comprises text information at letter level and appearance information corresponding to the text information, the text information extraction method extracts semantic information at letter level through fast text model FastText and extracts appearance information of the text through fast region object detector Faster-RCNN aiming at each detected text, and similarly, the text information, appearance representation of the text information and corresponding position information of character sets are fused by using a full connection layer, and a fusion formula is as follows:
Figure BDA0002553991820000103
wherein, F o Indicating the location information of the text, LayerNorm indicates the regularization layer,
Figure BDA0002553991820000111
and
Figure BDA0002553991820000112
respectively representing trainable parameters of the model, O ft Text information indicating the detected text, O fr Visual information representing the detected text, B o Position information of the detected text outer border of the representation.
S3, extracting problem information of the problem text by using a bidirectional converter representation encoder BERT;
in this embodiment, for a given problem, a Bidirectional converter Representation Encoder (BERT) is used to extract features of a problem text, the Bidirectional converter Representation Encoder is pre-trained, the problem text is finely tuned in a training process of the present application, and the extraction process of the problem features is as follows:
Q={q 1 ,...,q K }=BERT(w 1 ,...,w K )
wherein Q represents problem information, Q K Representing semantic information of the K-th word, BERT representing a bidirectional transducer characterizing the encoder, w K Representing a one-hot code represented by the K-th word.
S4, obtaining the overlapping relation between the position information of the object and the position information of the text by using an OCR-object graph convolution network according to the position information of the object and the position information of the text;
in this embodiment, in the real-world example, if the text in the picture and the object overlap, there is a relationship between the two, so an OCR-object graph convolution network is defined according to the overlap relationship between the position of the object in the picture and the position of the text, and the OCR-object graph convolution network includes three sub-graphs: an OCR-OCR map, an object-object map, and an OCR-object map. Considering the relationship between the OCR inner parts, the relationship between the objects is at the semantic level, and it is assumed that there is a semantic relationship between the two, so the application defines the OCR-OCR graph and the object-object graph as an identity matrix. The OCR-object graph can be defined according to an overlapping relation, if the ith text and the jth object have an overlapping relation, the corresponding A is carried out i,j 1, if there is no overlapping relationship between the two, it corresponds to A i,j Is 0. The final OCR-object graph convolution network A is thus defined as follows:
Figure BDA0002553991820000121
wherein A represents an OCR-object correlation diagram, A v Representing object-object diagram, A o Represented as an OCR-OCR graph, A r Represented as an OCR-object graph, A rT Is represented by A r The conjugate transpose of (c).
S5, encoding the overlapping relation by using a multi-gate step mechanism graph convolution network;
in the embodiment, in order to encode the incidence relation between the text and the object into the features, a multi-step door machine is used for constructing the graph convolution network, the single-step graph convolution network can only learn the semantic information of the adjacent nodes, but the logic in real life is possibly complex, and the single-step reasoning cannot meet the display requirement, so that the method is expanded into the multi-step graph convolution network. Node feature X of OCR-object graph 0 Expressed as:
X 0 =[F v ,F o ]
wherein, F v Characteristic of the detected object, F o Representing features of the detected text.
The calculation process of the single-step gate mechanism graph convolution network element is as follows:
Y=AW g X t
U=σ(W 1 Y+W 2 X t )
R=σ(W 3 Y+W 4 X t )
Figure BDA0002553991820000122
Figure BDA0002553991820000123
wherein A represents an OCR-object correlation diagram, W g 、W 1 、W 2 、W 3 、W 4 、W 5 And W 6 Respectively representing parameters which can be learned in the model, Y representing the primary characteristics which are fused by the relational graph, and sigma (DEG) representing a Sigmoid functionTanh (-) denotes a hyperbolic tangent function, U denotes an update gate, R denotes a reset gate,
Figure BDA0002553991820000124
representing a characteristic signal, X, passing through a reset gate t Shown is the input to the gated graph convolution network of step t, X t+1 The output is shown, while it is also the input for step t + 1.
The application simplifies the single-step door graph convolution network into:
X t+1 =GatedGCN(A,X t )
wherein, X t Shown is the input to the gated graph convolution network of step t, X t+1 Showing its output and its input at step t +1, A showing the OCR-object correlation diagram, and finally performing step J to obtain the fusion feature F joint
In order to expand the correlation length of the graph, the gate graph convolution neural network is executed with J steps to ensure that the correlation length is J, and therefore the final output characteristic is X J
And S6, inputting the question information and the coding information into a converter network to obtain a text visual answer, and completing the text visual question and answer based on concept interaction and associated semantics.
In this embodiment, because the questions needed in the text visual question-answering indicate which information in the pictures the model needs to pay attention to, a converter network related to the questions is designed, the converter network is composed of multiple converter layers, the four concepts of visual image information, text information, question information and previous output are jointly input into the concept interaction converter network, and the specific calculation process is as shown in the formula:
F t joint =[X J ,Q,P t-1 ]
F t final =Transformer(F t joint )
wherein the fusion characteristic obtained after the step J is finally executed is X J Q denotes the problem feature after coding, P t-1 Represents the first t-1Output of the first t-1 step of the answer, F t final Representing the characteristics of the t-th step after being coded by the Transformer, the Transformer represents a multi-layer Transformer network, F t joint Representing the characteristics of the t step before being coded by the converter.
In this example, the final feature expression F is taken t final Last hidden layer unit of
Figure BDA0002553991820000131
As decoded hidden layer state and uses this hidden layer state to predict the final answer:
Figure BDA0002553991820000132
wherein the content of the first and second substances,
Figure BDA0002553991820000133
representing the probability distribution when an answer is selected from the list of answer words,
Figure BDA0002553991820000134
representing trainable parameters in the network.
However, considering that the answer to the question is very likely to select the text in a picture as the answer, a plagiarism idea is introduced to help the model select the final answer directly from the text, and it is assumed that the OCR feature is intercepted from the final coding feature to be O joint And the final selected result answer calculation process is as follows:
Figure BDA0002553991820000141
wherein the content of the first and second substances,
Figure BDA0002553991820000142
representing the probability distribution of the answer when it picks the answer from the test text,
Figure BDA0002553991820000143
and
Figure BDA0002553991820000144
respectively representing trainable parameters in the network, O joint Representing an output representing OCR corresponding location coordinates in the output of the gated graph convolutional network,
Figure BDA0002553991820000145
output F representing a gated graph convolutional network joint The last hidden unit in (1).
Thus, two answer feature distributions are obtained, the two answer feature distributions are connected together, and finally, the answer with the highest score is selected as the generated answer in the step t:
Figure BDA0002553991820000146
wherein, y t Showing the distribution of the final answer,
Figure BDA0002553991820000147
representing the probability distribution when selecting an answer from the list of answer words,
Figure BDA0002553991820000148
the probability distribution of the answer when it is selected from the test text is shown. In addition, in the training phase of the model, the present embodiment uses the cross entropy of binarization as the final loss function, and in the inference phase, when the answer generates "END", the model stops continuing the operation, and the final answer is the answer to the question.

Claims (9)

1. A text visual question-answering system based on concept interaction and associated semantics is characterized by comprising an object position extraction module, a first full connection layer connected with the object position extraction module, a text information extraction module, a second full connection layer connected with the text information extraction module, an OCR-object graph convolution network respectively connected with the first full connection layer and the second full connection layer, a multi-gate-step mechanism graph convolution network connected with the OCR-object graph convolution network, and a converter network connected with the multi-gate-step mechanism graph convolution network, wherein the converter network is connected with a bidirectional converter representation encoder BERT;
the object position extraction module is used for extracting visual features in the image and corresponding position information by utilizing a pre-trained fast-RCNN fast region object detector model;
the first full-connection layer is used for fusing the visual features and the corresponding position information thereof to obtain the position information of the object;
the text information extraction module is used for detecting text information in the image and corresponding position information thereof by using an OCR recognition model;
the second full-connection layer is used for fusing the text information and the corresponding position information thereof to obtain the position information of the text;
the OCR-object graph convolution network is used for extracting the overlapping relation between the position information of the object and the position information of the text according to the position information of the object and the position information of the text;
the multi-gate step mechanism graph convolution network is used for coding the overlapping relation;
the bidirectional converter represents an encoder BERT and is used for extracting problem information of a problem text;
the converter network is used for obtaining text visual answers according to the question information and the coding information;
the concept interaction is to input the four concepts of visual image information, text information, problem information and previous output into a concept interaction converter network, and the specific calculation process is shown as a formula:
F t joint =[X J ,Q,P t-1 ]
F t final =Transformer(F t joint )
wherein the fusion characteristic obtained after the step J is finally executed is X J Q denotes the problem feature after coding, P t-1 Output of the first t-1 step representing the first t-1 answers, F t final Representing the characteristics of the t-th step after being coded by the Transformer, the Transformer represents a multi-layer Transformer network, F t joint Representing the characteristics of the t step before the coding of the converter;
the multi-door step mechanism graph convolution network is formed by executing J steps by a single-door mechanism graph convolution network;
the expression of the graph convolution network of the single-step door mechanism is as follows:
X t+1 =GatedGCN(A,X t )
wherein, X t Input to the gated graph convolution network, X, representing step t t+1 The output of the gated graph convolution network of the t step and the input of the gated graph convolution network of the t +1 step are shown, A represents an OCR-object correlation graph, and the gated graph convolution network of a GatedGCN single step.
2. The concept-interaction and associated-semantic-based text visual question-answering system of claim 1, wherein the OCR-object graph volume network comprises an OCR-OCR subgraph, an object-object subgraph and an OCR-object subgraph;
the OCR-OCR sub-image is used for modeling the relation between texts detected in the image and establishing the relation between any two texts;
the object-object subgraph is used for modeling the relationship between the objects detected in the image and establishing the relationship between any two objects;
and the OCR-object subgraph is used for modeling the relation between the text and the object to obtain the overlapping relation between the position information of the object and the position information of the text.
3. A text visual question-answering method based on concept interaction and associated semantics is characterized by comprising the following steps:
s1, extracting visual features and corresponding position information in the image by using a pre-trained fast-RCNN fast region object detector model, and fusing the visual features and the corresponding position information by using a first full-connection layer to obtain position information of an object;
s2, detecting text information in the image and corresponding position information thereof by using an OCR recognition model, and fusing the text information and the corresponding position information thereof by using a second full-connection layer to obtain the position information of the text;
s3, extracting problem information of the problem text by using a bidirectional converter representation encoder BERT;
s4, obtaining the overlapping relation between the position information of the object and the position information of the text by utilizing an OCR-object graph convolution network according to the position information of the object and the position information of the text;
s5, encoding the overlapping relation by using a multi-gate step mechanism graph convolution network;
s6, inputting the question information and the coding information into a converter network to obtain a text visual answer, and completing the text visual question and answer based on concept interaction and associated semantics; wherein the content of the first and second substances,
the concept interaction is that the four concepts of visual image information, text information, problem information and previous output are input into a concept interaction converter network together, and the specific calculation process is shown as a formula:
F t joint =[X J ,Q,P t-1 ]
F t final =Transformer(F t joint )
wherein the fusion characteristic obtained after the step J is finally executed is X J Q denotes the problem feature after coding, P t-1 Output of the first t-1 step representing the first t-1 answers, F t final Representing the characteristics of the t-th step after being coded by the Transformer, the Transformer represents a multi-layer Transformer network, F t joint Representing the characteristics of the t step before the coding of the converter;
the multi-step graph convolution network in the step S5 is formed by executing J steps by the single-step graph convolution network;
the expression of the graph convolution network of the single-step door mechanism is as follows:
X t+1 =GatedGCN(A,X t )
wherein, X t Input to the gated graph convolution network, X, representing step t t+1 The output of the gated graph convolution network of the t step and the input of the gated graph convolution network of the t +1 step are shown, A represents an OCR-object correlation graph, and the gated graph convolution network of a GatedGCN single step.
4. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression fused in the step S1 is as follows:
F v =LayerNorm(W v V+W b f B v )
wherein, F v Representing the fused feature matrix, LayerNorm the regularization layer, W v And W b f Respectively representing parameters needing to be updated and trained in the neural network, V representing the visual characteristics of the detected object, B v Indicating the position information of the detected outer frame of the object.
5. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein said step S2 comprises the steps of:
s201, extracting text information of letter level in each detection image by using a FastText rapid text model;
s202, extracting appearance information of texts in each detection image by using a fast-RCNN rapid region object detector;
s203, detecting text information at letter level and position information corresponding to appearance information in the text by using an OCR model;
and S204, fusing the text information at the letter level, the appearance information in the text and the position information by using a second full-connection layer to obtain the position information of the text.
6. The text visual question-answering method based on concept interaction and associated semantics of claim 5, wherein the expression of the position information of the text in the step S204 is as follows:
Figure FDA0003730468650000041
wherein, F o Indicating the location information of the text, LayerNorm indicates the regularization layer,
Figure FDA0003730468650000042
and
Figure FDA0003730468650000043
respectively representing trainable parameters of the model, O ft Text information indicating the detected text, O fr Visual information representing the detected text, B o Position information indicating the detected text outer border.
7. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the question information of the question text extracted in the step S3 is as follows:
Q={q 1 ,...,q K }=BERT(w 1 ,...,w K )
wherein Q represents problem information, Q K Representing semantic information of the Kth word, BERT representing a bidirectional converter characterizing the encoder, w K Representing a one-hot code represented by the K-th word.
8. The text visual question answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the OCR-object graph convolutional network in the step S4 is as follows:
Figure FDA0003730468650000051
wherein A represents an OCR-object correlation diagram,A v Representing object-object subgraph, A o Represented as an OCR-OCR sub-graph, A r Represented as an OCR-object sub-graph, A rT Is shown as A r The conjugate transpose of (c).
9. The text visual question-answering method based on concept interaction and associated semantics of claim 3, wherein the expression of the text visual answer in the S6 is as follows:
Figure FDA0003730468650000052
Figure FDA0003730468650000053
Figure FDA0003730468650000054
wherein, y t A visual response to the text is represented,
Figure FDA0003730468650000055
representing a probability distribution of answers selected from the answer vocabulary,
Figure FDA0003730468650000056
representing a probability distribution of answers to pick from the test text,
Figure FDA0003730468650000057
and
Figure FDA0003730468650000058
respectively representing trainable parameters in the network, O joint An output representing corresponding location coordinates in the output of the multi-gate step map convolutional network,
Figure FDA0003730468650000059
representing the last hidden unit in the output of the multi-gate step map convolutional network, Concat represents the merge join operation.
CN202010584144.1A 2020-06-24 2020-06-24 Text visual question-answering system and method based on concept interaction and associated semantics Active CN111858882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010584144.1A CN111858882B (en) 2020-06-24 2020-06-24 Text visual question-answering system and method based on concept interaction and associated semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010584144.1A CN111858882B (en) 2020-06-24 2020-06-24 Text visual question-answering system and method based on concept interaction and associated semantics

Publications (2)

Publication Number Publication Date
CN111858882A CN111858882A (en) 2020-10-30
CN111858882B true CN111858882B (en) 2022-08-09

Family

ID=72988448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010584144.1A Active CN111858882B (en) 2020-06-24 2020-06-24 Text visual question-answering system and method based on concept interaction and associated semantics

Country Status (1)

Country Link
CN (1) CN111858882B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220171980A1 (en) * 2020-12-02 2022-06-02 Sap Se Detecting The Same Type of Objects in Images Using Machine Learning Models

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112308080B (en) * 2020-11-05 2023-05-30 南强智视(厦门)科技有限公司 Directional visual understanding and segmentation-oriented image description prediction method
CN112507727A (en) * 2020-11-18 2021-03-16 北京科技大学 Text visual question-answering system and method based on text
CN112800191B (en) * 2020-12-31 2023-01-17 科大讯飞股份有限公司 Question and answer method and device based on picture and computer readable storage medium
CN113204666B (en) * 2021-05-26 2022-04-05 杭州联汇科技股份有限公司 Method for searching matched pictures based on characters
CN113360699B (en) * 2021-06-30 2023-09-26 北京百度网讯科技有限公司 Model training method and device, and image question-answering method and device
CN113609355B (en) * 2021-07-15 2022-06-03 哈尔滨理工大学 Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning
CN113807218B (en) * 2021-09-03 2024-02-20 科大讯飞股份有限公司 Layout analysis method, device, computer equipment and storage medium
CN114707017B (en) * 2022-04-20 2023-05-23 北京百度网讯科技有限公司 Visual question-answering method, visual question-answering device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN109255359A (en) * 2018-09-27 2019-01-22 南京邮电大学 A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method
CN109671125A (en) * 2018-12-17 2019-04-23 电子科技大学 A kind of GAN network model that height merges and the method for realizing text generation image
CN110019749A (en) * 2018-09-28 2019-07-16 北京百度网讯科技有限公司 Generate method, apparatus, equipment and the computer-readable medium of VQA training data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
US20200134398A1 (en) * 2018-10-29 2020-04-30 Sri International Determining intent from multimodal content embedded in a common geometric space

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN109255359A (en) * 2018-09-27 2019-01-22 南京邮电大学 A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method
CN110019749A (en) * 2018-09-28 2019-07-16 北京百度网讯科技有限公司 Generate method, apparatus, equipment and the computer-readable medium of VQA training data
CN109671125A (en) * 2018-12-17 2019-04-23 电子科技大学 A kind of GAN network model that height merges and the method for realizing text generation image

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering;Xiangpeng Li;《The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)》;20190731;第8658-8665页 *
Hierarchical LSTMs with Adaptive Attention for Visual Captioning;Lianli Gao;《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》;20200531;第42卷(第5期);第1112-1131页 *
Matching Article Pairs with Graphical Decomposition and Convolutions;Bang Liu;《Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics》;20190101;全文 *
Self-Supervised Video Hashing with Hierarchical Binary Auto-encoder;Jingkuan Song;《TRANSACTIONS ON IMAGE PROCESSING》;20180730;第1-12页 *
Stacked Attention Networks for Image Question Answering;Zichao Yang;《2016 CVPR》;20161212;第21-29页 *
TOBIAS FÄRDIG.Query Concept Interaction over Time.《Master of Science Thesis in the Programme Computer Science: Algorithms Languages and Logic》.2012,全文. *
基于深度学习的视觉问答技术研究;白林亭等;《航空计算技术》;20180925(第05期);全文 *
学习科学视域下的e-Learning 深度学习研究;段金菊;《远程教育杂志》;20130801;第42-51页 *
视觉问答技术研究;俞俊等;《计算机研究与发展》;20180915(第09期);全文 *
面向人机对话意图分类的混合神经网络模型;周俊佐;《软件学报》;20190404;第30卷(第11期);第3313-3325页 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220171980A1 (en) * 2020-12-02 2022-06-02 Sap Se Detecting The Same Type of Objects in Images Using Machine Learning Models

Also Published As

Publication number Publication date
CN111858882A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111858882B (en) Text visual question-answering system and method based on concept interaction and associated semantics
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110717431A (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
Das et al. Handwritten arabic numeral recognition using a multi layer perceptron
CN112036276B (en) Artificial intelligent video question-answering method
CN109783666A (en) A kind of image scene map generation method based on iteration fining
Yuan et al. 3G structure for image caption generation
CN113378676A (en) Method for detecting figure interaction in image based on multi-feature fusion
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN111598183A (en) Multi-feature fusion image description method
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN114663915A (en) Image human-object interaction positioning method and system based on Transformer model
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
Yan et al. Multimodal feature fusion based on object relation for video captioning
Yang et al. GID-Net: Detecting human-object interaction with global and instance dependency
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
Zheng et al. Modular graph attention network for complex visual relational reasoning
CN113010712B (en) Visual question answering method based on multi-graph fusion
CN114117104B (en) Image text matching method based on multi-relation perception reasoning
CN114168769A (en) Visual question-answering method based on GAT (generic object transform) relational reasoning
CN115186072A (en) Knowledge graph visual question-answering method based on double-process cognitive theory
CN113516055A (en) Robot question-answering method for article search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant