CN110516791B - Visual question-answering method and system based on multiple attention - Google Patents

Visual question-answering method and system based on multiple attention Download PDF

Info

Publication number
CN110516791B
CN110516791B CN201910770172.XA CN201910770172A CN110516791B CN 110516791 B CN110516791 B CN 110516791B CN 201910770172 A CN201910770172 A CN 201910770172A CN 110516791 B CN110516791 B CN 110516791B
Authority
CN
China
Prior art keywords
picture
information
region
long
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910770172.XA
Other languages
Chinese (zh)
Other versions
CN110516791A (en
Inventor
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Science And Technology Co ltd
Original Assignee
Beijing Moviebook Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Science And Technology Co ltd filed Critical Beijing Moviebook Science And Technology Co ltd
Priority to CN201910770172.XA priority Critical patent/CN110516791B/en
Publication of CN110516791A publication Critical patent/CN110516791A/en
Application granted granted Critical
Publication of CN110516791B publication Critical patent/CN110516791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Abstract

The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing the combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed. Based on the multi-attention-based visual question-answering method and system provided by the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.

Description

Visual question-answering method and system based on multiple attention
Technical Field
The application relates to the field of visual question answering, in particular to a visual question answering method and system based on multiple attention.
Background
The visual question-answer is a learning task related to computer vision and natural language processing, namely, the computer learns the input picture and question to output an answer which accords with natural language rules and logic content, only focuses on a part of objects in the picture according to the difference of the questions, and some questions need certain common sense reasoning to obtain the answer, so that the visual question-answer has higher requirements on semantic understanding of the image compared with general talking-in-picture and faces larger challenges.
The existing models in the field of visual question answering at present comprise a deep LSTM Q + norm I model, a VIS + LSTM model and the like, but the model has higher accuracy in answering simple questions with single answers, the accuracy of the model in other aspects is generally lower, the structure is relatively simple, the content and the form of the answers are single, and the correct answers cannot be made to the slightly complex questions needing more priori knowledge to carry out simple reasoning.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to an aspect of the present application, there is provided a multi-attention-based visual question-answering method, including:
acquiring picture information in a visual question and answer to be processed and question information corresponding to the picture information;
extracting picture characteristic data in the picture information;
simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weight;
and communicating a second long-short-term memory network for executing semantic analysis with the first long-short-term memory network so as to carry out reasoning combination on the picture content information and the question information, and obtaining and outputting the answer of the visual question and answer to be processed.
Optionally, the extracting of the picture feature data in the picture information includes:
extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN;
extracting at least one region feature information in each feature region through ResNet;
and for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region.
Optionally, before extracting the picture feature data in the picture information, the method further includes:
pre-training the R-CNN and/or ResNet based on a preset data set;
fusing each region feature of the pictures contained in the preset data set with a vector representing a real category;
and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.
Optionally, the simultaneously inputting the picture feature data and the question information into a first long-short term memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture feature data by assigning attention weights includes:
simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;
acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.
Optionally, the communicating the second long-term and short-term memory network for performing semantic analysis with the first long-term and short-term memory network to perform inference combination on the picture content information and the question information, so as to obtain and output an answer of the to-be-processed visual question-answer, including:
simultaneously inputting the question information and content information corresponding to the regional characteristic data into a second long-term and short-term memory network for performing semantic analysis; wherein the input of each timestamp of the second long and short term memory network comprises: a hidden layer output of a first long-short time memory network, an output of a timestamp on a second long-short time memory network and a word vector in the problem information;
and outputting answers aiming at the questions based on the second long-short time memory network, outputting a calculation result to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.
According to another aspect of the present application, there is provided a multi-attention based visual question-answering system, comprising:
the information acquisition module is configured to acquire picture information in the visual question and answer to be processed and question information corresponding to the picture information;
a picture feature extraction module configured to extract picture feature data in the picture information;
a picture content acquisition module configured to simultaneously input the picture feature data and the question information into a first long-short time memory network based on an attention mechanism, and acquire picture content information corresponding to the picture feature data by assigning attention weights;
and the answer output module is configured to communicate a second long-short-term memory network for performing semantic analysis with the first long-short-term memory network so as to perform reasoning combination on the picture content information and the question information, and obtain and output the answer of the to-be-processed visual question and answer.
Optionally, the picture feature extraction module is further configured to:
extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN;
extracting at least one region feature information in each feature region through ResNet;
and for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region.
Optionally, the system further comprises:
a pre-training module configured to pre-train the R-CNN and/or ResNet based on a preset data set; fusing each region feature of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.
Optionally, the picture content obtaining module is further configured to:
inputting the picture feature data into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;
acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.
Optionally, the answer output module is further configured to:
simultaneously inputting the question information and content information corresponding to the regional characteristic data into a second long-term and short-term memory network for performing semantic analysis; wherein the input of each timestamp of the second long and short term memory network comprises: a hidden layer output of a first long-short time memory network, an output of a timestamp on a second long-short time memory network and a word vector in the problem information;
and outputting answers aiming at the questions based on the second long-short time memory network, outputting a calculation result to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.
The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing the combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed.
Based on the multi-attention-based visual question-answering method and system provided by the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart of a multi-attention based visual question answering method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a bidirectional LSTM workflow according to an embodiment of the present application;
FIG. 3 is a schematic diagram of picture information in a visual question answering according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a multi-attention-based visual question-answering system according to an embodiment of the present application.
FIG. 5 is a schematic diagram of a multi-attention based visual question-answering system in accordance with a preferred embodiment of the present application;
FIG. 6 is a schematic diagram of a computing device according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present social situation.
Detailed Description
Fig. 1 is a flow chart of a multi-attention-based visual question answering method according to an embodiment of the present application. As known from fig. 1, a method for multiple attention-based visual question answering provided in an embodiment of the present application may include:
step S101: acquiring picture information in a visual question and answer to be processed and question information corresponding to the picture information;
step S102: extracting picture characteristic data in the picture information;
step S103: simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weight;
step S104: and communicating the second long-short-term memory network for executing semantic analysis with the first long-short-term memory network to perform reasoning combination on the picture content information and the question information to obtain and output answers of the visual questions and answers to be processed.
The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed.
Long Short-Term Memory (LSTM) is a time recurrent neural network, suitable for processing and predicting important events with relatively Long intervals and delays in time series, and an LSTM-based system can learn tasks of translating languages, controlling robots, image analysis, document summarization, speech recognition, image recognition, handwriting recognition, controlling chat robots, predicting diseases, click rates and stocks, synthesizing music, and the like.
For the traditional visual question-answering model deep LSTM Q + norm I model, I represents the extracted picture features, norm I represents the 1024-dimensional pixel semantic vector extracted by CNN and is normalized by L2. And (3) extracting image semantic information by CNN, then taking text semantic information contained in the question by LSTM, fusing data of the two networks, enabling the model to learn the meaning of the question, and finally sending the meaning of the question into a multilayer MLP (Multi-layer processing) as a Softmax output layer to generate answer output. Inputting images containing 2 horses and 2 persons in an outdoor scene, processing the data through CNN without a classification layer, extracting problem information by each problem part according to the input sequence of problem words through an RNN network, fusing the two compressed information, and sending the processed data into an MLP (multiple layer processing) to generate a result output (for example, the current problem is a count object problem). The model uses a 2-layer LSTM encoding problem and divides the image into regions using the VGGNet model, followed by L2 normalization of the image features. And then, transforming the image and the question features into the same feature space, fusing the information in a dot-product mode, and sending the fused information into a three-layer MLP (Multi-layer processing) with Softmax as a classifier to generate answer output. During the training process of the model, pre-CNN, only LSTM layer and the final classification network participate in the training.
The VIS + LSTM model has the basic structure that firstly CNN is used for extracting picture information, and then LSTM is used for generating a prediction result after the picture information is extracted. However, it is considered that since there is no intact standard for evaluating the accuracy of sentences of answers, they focus their attention on limited domain questions, which can use a word as an answer to a visual question and answer, and thus can regard the visual question and answer as a multi-category question, and thus can measure the answers using the existing accuracy evaluation standard.
The overall accuracy of the above mentioned models is not high, and the content and form of the answer are relatively single.
In the present embodiment, the first long-short term memory network based on attention mechanism (which may be referred to as attention LSTM hereinafter) mainly trains a model in LSTM to selectively learn the input sequence and selectively focuses on considering the corresponding relevant information in the input when the model is output. The second long-term and short-term memory network (which may be referred to as semantic LSTM hereinafter) for performing semantic analysis is mainly used for mining and learning deep concepts of texts, pictures and the like in LSTM, so that the problem of gradient disappearance can be avoided. The attention LSTM and the semantic LSTM are both preferably bidirectional LSTM, and can be communicated with each other to cooperatively complete the combination of pictures and questions in the visual question answering so as to accurately and quickly output answers to the questions.
Generally, in the visual question answering, a picture and a natural language question are input first, and then picture information is focused according to the natural language question, so that a human language is generated and output as an answer. Therefore, when the visual question and answer is solved, the above step S101 is executed first to obtain the picture information and question information in the visual question and answer.
Further, step S102 may be executed to analyze the picture information therein to extract the feature data. Optionally, it may include: extracting at least one characteristic region in the picture information by using the R-CNN; extracting at least one region feature information in each feature region through ResNet; and then, for each feature region, performing feature screening on the region feature information in the feature region according to the overlapping degree IoU, and performing average pooling on the screened region feature information to obtain feature data of each feature region.
The R-CNN is used for finding the interested characteristic region in the picture information and further extracting the regional characteristic information in the extracted characteristic region by using ResNet.
The full name of R-CNN is Region-CNN, which is the first algorithm to successfully apply deep learning to target detection. The R-CNN realizes a target detection technology based on algorithms such as a Convolutional Neural Network (CNN), linear regression, a Support Vector Machine (SVM) and the like.
ResNet is called a deep residual error network, and a jump connection is introduced at a place different from a common network, so that information of a previous residual error block can flow into a next residual error block without being blocked, information circulation is improved, and the problems of disappearing gradient and degradation caused by the fact that the network is too deep are avoided.
After extracting the region feature information of each feature region, feature screening may be performed on the region feature information of each feature region according to the overlapping degree IoU. IoU, collectively referred to as interaction overlapping Unit, is a standard for measuring the accuracy of detecting corresponding objects in a particular data set. The feature data in the region is filtered by comparing the value based on IoU with a preset threshold to further narrow down the picture information. And finally, representing the screened features by using average pooled convolution features to acquire regional feature data of each feature region, and in addition, splicing the regional feature data to acquire a feature splicing map of picture information in a question answering system and using the feature splicing map as an output result of the R-CNN.
In an optional embodiment of the invention, before feature extraction by using R-CNN and ResNet, the R-CNN and ResNet can be pre-trained to fuse the region of interest and possible classes. The method specifically comprises the following steps: pre-training the R-CNN and/or ResNet based on a preset data set; fusing each regional characteristic of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to a full connection layer of the R-CNN and/or ResNet to output so as to classify the attribute class and the non-attribute class by softmax.
In this embodiment, the parameters of the model can be optimized by pre-training the R-CNN and ResNet, so that the model can find the interested region. The preset data set may preferably be COCO, which is collectively referred to as Common Objects in countext, and is a data set that can be used for image recognition. Images in the MS COCO dataset are divided into training, validation and test sets. COCO collects images by searching 80 object classes and various scene types on a search engine. The COCO dataset now has 3 label types: object instances, object keypoints, and image references (see talking), are stored using JSON files. Compared with the existing models such as a Deeper LSTM Q + norm I model, a VIS + LSTM model and the like, the method has higher accuracy in question answering.
After the feature data of the image information is obtained, the combination of the question and the picture can be completed through two bidirectional LSTMs, and the answer is output. That is, in the method provided in this embodiment, the data of the picture may be preprocessed, the corresponding region in the picture is mapped with the corresponding category, an attention mechanism LSTM fuses the word with the specific region in the picture, the content of the specific region is analyzed, and finally the output result of the attention mechanism LSTM is taken into the semantic LSTM, and the word is inferred and combined to generate the answer to the question.
Referring to the above-described step S103, the picture feature data and the question information are simultaneously input to the attention mechanism-based LSTM, and the picture content information corresponding to the picture feature data is acquired by assigning the attention weight. This may include inputting the picture feature data and the question information simultaneously into an attention-based LSTM; wherein, at each timestamp of the attention-based LSTM, its inputs include: outputting semantic LSTM of the last timestamp, feature data of each region and LSTM of the last timestamp based on attention mechanism; the output of which comprises: assigning an attention weight to each of the region feature data; content information corresponding to each of the region feature data is then acquired based on each of the region feature data attention weights.
On each timestamp, combining the output of two LSTMs of the timestamp and the extracted picture feature data, each cell of the attention LSTM has output, and different attention weights are distributed to all region features through the passage of the timestamp, wherein the attention weights are parameters needing to be learned, and the output on each timestamp and the attention weights are combined to output data for semantic LSTM processing.
Then, step S104 is executed to communicate the semantic LSTM with the attention LSTM, so as to perform inference combination on the picture content information and the question information, and obtain and output an answer of the to-be-processed visual question-answer, which may include: simultaneously inputting the question information and the content information corresponding to the characteristic data of each area into a semantic LSTM; wherein, the input of each timestamp of the semantic LSTM comprises: attention is paid to hidden layer output of LSTM, output of a timestamp on semantic LSTM and a word vector in question information; and outputting answers aiming at the questions based on the semantic LSTM, outputting the calculation results to a softmax layer, and selecting the word vector with the maximum probability as the answer of the visual question answers to be processed for outputting.
The attention LSTM hidden layer output contains attention weight, and is a calculation result of combining the attention weight and the feature data output by the attention LSTM current time stamp.
The word vector of the above-mentioned problem to detect "? "represents the end of a sentence, whose essence is a word vector matrix, corresponding to a one-hot encoding of a word, the word vector being randomly generated without pre-training.
The attention LSTM and semantic LSTM referred to in steps S103 and S104 above are bidirectional, and as shown in fig. 2, two bidirectional LSTM workflows with time series t ═ {1,2, … …, n } may include:
S1-1,t1at the moment, the region feature data is input into t1Attention at time LSTM, and then t1Output of moment attention LSTM and first word vector input t in question information1Semantic LSTM of time;
S1-2,t2time of day, will t1Two LSTM inputs of a timeOut-sum region feature data input t2Attention at time LSTM, then let t2Output, t, at times based on attention LSTM1Output of semantic LSTM of time of day and second word vector input t in question information2Semantic LSTM of time;
……
S1-n,tntime of day, will tn-1Two LSTM outputs at a time and a region feature data input tnAttention at time LSTM, then let tnOutput of attention at time LSTM, tn-1The output of the semantic LSTM at time and the nth word vector in the question information (word vector withnSemantic LSTM of a time of day.
Finally, from tnThe semantic LSTM of the time of day outputs the answer to the question.
For example, suppose that the picture information in the visual question-answer is as shown in fig. 3, and the question information is "what is in bed? ", the visual question answering method provided based on this embodiment may include:
s2-1, firstly, acquiring picture information and question information in the visual question and answer;
s2-2, extracting feature data in the picture through R-CNN and ResNet, wherein a plurality of feature areas of interest in the picture are extracted through R-CNN, such as a picture area 1 desk and a picture area 2 bed; extracting a plurality of regional characteristic information in each characteristic region through ResNet; and then, screening the features in each feature area according to the overlapping degree IoU, and performing average pooling on the screened features in each area to further acquire picture feature data. Wherein R-CNN and ResNet are pre-trained;
s2-3, inputting the picture characteristic data and question information into the attention LSTM at the same time, and obtaining the picture content information, such as computer, desk lamp, bookshelf, book, etc. in the area 1, and book, medicine, etc. in the area 2. Further, the feature data included in each region is also assigned with a weight, such as a book-to-weight ratio, a medicine-to-weight ratio, and the like in the region 2; combining the output of attention LSTM with attention weight to output the final picture content information, such as the picture content information obtained by assigning different weights in region 2 as book
And S2-4, bidirectionally communicating the semantic LSTM with the attention LSTM, reasoning and combining picture content information and question information, outputting the result of the last step of the semantic LSTM to a softmax layer, and selecting the word vector with the maximum probability as an answer to output. The question points to area 2 and answer books are output.
Based on the same inventive concept, as shown in fig. 4, an embodiment of the present application further provides a multi-attention-based visual question-answering system 400, including:
an information obtaining module 410 configured to obtain picture information in the to-be-processed visual question and question information corresponding to the picture information;
a picture feature extraction module 420 configured to extract picture feature data in the picture information;
a picture content acquiring module 430 configured to simultaneously input the picture feature data and the question information into an attention mechanism-based LSTM, and acquire picture content information corresponding to the picture feature data by assigning attention weights;
an answer output module 440 configured to communicate the semantic LSTM with the attention LSTM to perform reasoning combination on the picture content information and the question information to obtain and output an answer of the to-be-processed visual question-answer.
In an optional embodiment of the present invention, the picture feature extraction module 420 is further configured to:
extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; and for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region.
In an alternative embodiment of the present invention, as shown in fig. 5, the system may further include:
a pre-training module 450 configured to pre-train the R-CNN and/or ResNet based on a preset data set; fusing each region feature of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.
In an optional embodiment of the present invention, the picture content acquiring module 430 is further configured to:
inputting the picture characteristic data into an LSTM based on an attention mechanism; wherein, at each timestamp of attention LSTM, its inputs include: outputting semantic LSTM of the last timestamp, feature data of each region and attention LSTM of the last timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;
content information corresponding to each of the area feature data is acquired based on each of the area feature data attention weights.
In an optional embodiment of the present invention, the answer output module 440 is further configured to:
simultaneously inputting the question information and the content information corresponding to the characteristic data of each area into a semantic LSTM; wherein, the input of each timestamp of the semantic LSTM comprises: attention is paid to hidden layer output of LSTM, output of a timestamp on semantic LSTM and a word vector in question information;
and outputting answers aiming at the questions based on the semantic LSTM, outputting the calculation results to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.
The embodiment of the application combines two fields of Computer Vision (CV) and Natural Language Processing (NLP) and provides a visual question-answering method and system based on multiple attention.
Based on the multi-attention-based visual question-answering method and system provided by the embodiment of the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
An embodiment of the present application also provides a computing device, referring to fig. 6, comprising a memory 620, a processor 610 and a computer program stored in said memory 620 and executable by said processor 610, the computer program being stored in a space 630 for program code in the memory 620, the computer program realizing the method step 631 according to the invention when executed by the processor 610.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 7, the computer readable storage medium comprises a storage unit for program code provided with a program 631' for performing the steps of the method according to the invention, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A visual question-answering method based on multiple attention comprises the following steps:
acquiring picture information in a visual question and answer to be processed and question information corresponding to the picture information;
extracting picture characteristic data in the picture information;
simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weight;
communicating a second long-short-term memory network for executing semantic analysis with the first long-short-term memory network so as to carry out reasoning combination on the picture content information and the question information, and obtaining and outputting an answer of the to-be-processed visual question and answer;
the extracting of the picture feature data in the picture information includes:
extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region;
the step of simultaneously inputting the picture characteristic data and the question information into a first long-short term memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weights includes:
simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data; acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.
2. The method according to claim 1, wherein before extracting the picture feature data in the picture information, the method further comprises:
pre-training the R-CNN and/or ResNet based on a preset data set;
fusing each region feature of the pictures contained in the preset data set with a vector representing a real category;
and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.
3. The method according to claim 1, wherein the communicating the second long-term and short-term memory network for performing semantic analysis with the first long-term and short-term memory network to perform inference combination on the picture content information and the question information to obtain and output the answer of the to-be-processed visual question-answer comprises:
simultaneously inputting the question information and content information corresponding to the regional characteristic data into a second long-term and short-term memory network for performing semantic analysis; wherein the input of each timestamp of the second long and short term memory network comprises: a hidden layer output of a first long-short time memory network, an output of a timestamp on a second long-short time memory network and a word vector in the problem information;
and outputting answers aiming at the questions based on the second long-short time memory network, outputting a calculation result to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.
4. A multi-attention based visual question-answering system comprising:
the information acquisition module is configured to acquire picture information in the visual question and answer to be processed and question information corresponding to the picture information;
a picture feature extraction module configured to extract picture feature data in the picture information;
a picture content acquisition module configured to simultaneously input the picture feature data and the question information into a first long-short time memory network based on an attention mechanism, and acquire picture content information corresponding to the picture feature data by assigning attention weights;
the answer output module is configured to communicate a second long-short-term memory network for performing semantic analysis with the first long-short-term memory network so as to perform reasoning combination on the picture content information and the question information to obtain and output an answer of the to-be-processed visual question and answer;
the picture feature extraction module further configured to: extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region;
the picture content acquisition module is further configured to:
inputting the picture feature data into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;
acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.
5. The system of claim 4, further comprising:
a pre-training module configured to pre-train the R-CNN and/or ResNet based on a preset data set; fusing each region feature of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.
6. The system of claim 4, wherein the answer output module is further configured to:
simultaneously inputting the question information and content information corresponding to the regional characteristic data into a second long-term and short-term memory network for performing semantic analysis; wherein the input of each timestamp of the second long and short term memory network comprises: a hidden layer output of a first long-short time memory network, an output of a timestamp on a second long-short time memory network and a word vector in the problem information;
and outputting answers aiming at the questions based on the second long-short time memory network, outputting a calculation result to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.
CN201910770172.XA 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention Active CN110516791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910770172.XA CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910770172.XA CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Publications (2)

Publication Number Publication Date
CN110516791A CN110516791A (en) 2019-11-29
CN110516791B true CN110516791B (en) 2022-04-22

Family

ID=68627077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910770172.XA Active CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Country Status (1)

Country Link
CN (1) CN110516791B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032535A (en) * 2019-12-24 2021-06-25 中国移动通信集团浙江有限公司 Visual question and answer method and device for assisting visually impaired people, computing equipment and storage medium
CN113590770B (en) * 2020-04-30 2024-03-08 北京京东乾石科技有限公司 Response method, device, equipment and storage medium based on point cloud data
CN112463936A (en) * 2020-09-24 2021-03-09 北京影谱科技股份有限公司 Visual question answering method and system based on three-dimensional information
CN112559877A (en) * 2020-12-24 2021-03-26 齐鲁工业大学 CTR (China railway) estimation method and system based on cross-platform heterogeneous data and behavior context
CN113010712B (en) * 2021-03-04 2022-12-02 天津大学 Visual question answering method based on multi-graph fusion
CN113283246B (en) * 2021-06-15 2024-01-30 咪咕文化科技有限公司 Visual interaction method, device, equipment and storage medium
CN113590879B (en) * 2021-08-05 2022-05-31 哈尔滨理工大学 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
CN116881427B (en) * 2023-09-05 2023-12-01 腾讯科技(深圳)有限公司 Question-answering processing method and device, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10909329B2 (en) * 2015-05-21 2021-02-02 Baidu Usa Llc Multilingual image question answering
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN107766447B (en) * 2017-09-25 2021-01-12 浙江大学 Method for solving video question-answer by using multilayer attention network mechanism
US10592767B2 (en) * 2017-10-27 2020-03-17 Salesforce.Com, Inc. Interpretable counting in visual question answering
CN108170816B (en) * 2017-12-31 2020-12-08 厦门大学 Intelligent visual question-answering method based on deep neural network
KR102039397B1 (en) * 2018-01-30 2019-11-01 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN108829756B (en) * 2018-05-25 2021-10-22 杭州一知智能科技有限公司 Method for solving multi-turn video question and answer by using hierarchical attention context network
CN109766427B (en) * 2019-01-15 2021-04-06 重庆邮电大学 Intelligent question-answering method based on collaborative attention for virtual learning environment
CN109857909B (en) * 2019-01-22 2020-11-20 杭州一知智能科技有限公司 Method for solving video conversation task by multi-granularity convolution self-attention context network
CN109829049B (en) * 2019-01-28 2021-06-01 杭州一知智能科技有限公司 Method for solving video question-answering task by using knowledge base progressive space-time attention network
CN109902164B (en) * 2019-03-06 2020-12-18 杭州一知智能科技有限公司 Method for solving question-answering of open long format video by using convolution bidirectional self-attention network
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110110043B (en) * 2019-04-11 2023-04-11 中山大学 Multi-hop visual problem reasoning model and reasoning method thereof

Also Published As

Publication number Publication date
CN110516791A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN110516791B (en) Visual question-answering method and system based on multiple attention
Arras et al. Evaluating recurrent neural network explanations
US11734375B2 (en) Automatic navigation of interactive web documents
Bahreini et al. Towards real-time speech emotion recognition for affective e-learning
KR102040400B1 (en) System and method for providing user-customized questions using machine learning
CN107066464A (en) Semantic Natural Language Vector Space
US20200065394A1 (en) Method and system for collecting data and detecting deception of a human using a multi-layered model
US11881010B2 (en) Machine learning for video analysis and feedback
Nasir et al. What if social robots look for productive engagement? Automated assessment of goal-centric engagement in learning applications
US11645561B2 (en) Question answering system influenced by user behavior and text metadata generation
US11294884B2 (en) Annotation assessment and adjudication
US11188517B2 (en) Annotation assessment and ground truth construction
Pugh et al. Do speech-based collaboration analytics generalize across task contexts?
Ikawati et al. Student behavior analysis to predict learning styles based felder silverman model using ensemble tree method
Verma et al. Web application implementation with machine learning
CN113627194B (en) Information extraction method and device, and communication message classification method and device
Mbunge et al. Diverging hybrid and deep learning models into predicting students’ performance in smart learning environments–a review
US10616532B1 (en) Behavioral influence system in socially collaborative tools
CN117172978B (en) Learning path information generation method, device, electronic equipment and medium
CN114398556A (en) Learning content recommendation method, device, equipment and storage medium
CN114037545A (en) Client recommendation method, device, equipment and storage medium
Mauk What do the qualitative data mean
Malahina et al. Teachable Machine: Real-Time Attendance of Students Based on Open Source System
Ahmed et al. Visual sentiment prediction with transfer learning and big data analytics for smart cities
KR102624135B1 (en) Artificial intelligence-based non-face-to-face programming training automation platform service provision method, device and system for enterprises

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Visual Q&A Method and System Based on Multiple Attention

Effective date of registration: 20230713

Granted publication date: 20220422

Pledgee: Bank of Jiangsu Limited by Share Ltd. Beijing branch

Pledgor: BEIJING MOVIEBOOK SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2023110000278

PE01 Entry into force of the registration of the contract for pledge of patent right