CN110516791B

CN110516791B - Visual question-answering method and system based on multiple attention

Info

Publication number: CN110516791B
Application number: CN201910770172.XA
Authority: CN
Inventors: 刘伟
Original assignee: Beijing Moviebook Science And Technology Co ltd
Current assignee: Beijing Moviebook Science And Technology Co ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2022-04-22
Anticipated expiration: 2039-08-20
Also published as: CN110516791A

Abstract

The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing the combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed. Based on the multi-attention-based visual question-answering method and system provided by the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.

Description

Visual question-answering method and system based on multiple attention

Technical Field

The application relates to the field of visual question answering, in particular to a visual question answering method and system based on multiple attention.

Background

The visual question-answer is a learning task related to computer vision and natural language processing, namely, the computer learns the input picture and question to output an answer which accords with natural language rules and logic content, only focuses on a part of objects in the picture according to the difference of the questions, and some questions need certain common sense reasoning to obtain the answer, so that the visual question-answer has higher requirements on semantic understanding of the image compared with general talking-in-picture and faces larger challenges.

The existing models in the field of visual question answering at present comprise a deep LSTM Q + norm I model, a VIS + LSTM model and the like, but the model has higher accuracy in answering simple questions with single answers, the accuracy of the model in other aspects is generally lower, the structure is relatively simple, the content and the form of the answers are single, and the correct answers cannot be made to the slightly complex questions needing more priori knowledge to carry out simple reasoning.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a multi-attention-based visual question-answering method, including:

acquiring picture information in a visual question and answer to be processed and question information corresponding to the picture information;

extracting picture characteristic data in the picture information;

simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weight;

and communicating a second long-short-term memory network for executing semantic analysis with the first long-short-term memory network so as to carry out reasoning combination on the picture content information and the question information, and obtaining and outputting the answer of the visual question and answer to be processed.

Optionally, the extracting of the picture feature data in the picture information includes:

extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN;

extracting at least one region feature information in each feature region through ResNet;

and for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region.

Optionally, before extracting the picture feature data in the picture information, the method further includes:

pre-training the R-CNN and/or ResNet based on a preset data set;

fusing each region feature of the pictures contained in the preset data set with a vector representing a real category;

and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.

Optionally, the simultaneously inputting the picture feature data and the question information into a first long-short term memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture feature data by assigning attention weights includes:

simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;

acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.

Optionally, the communicating the second long-term and short-term memory network for performing semantic analysis with the first long-term and short-term memory network to perform inference combination on the picture content information and the question information, so as to obtain and output an answer of the to-be-processed visual question-answer, including:

simultaneously inputting the question information and content information corresponding to the regional characteristic data into a second long-term and short-term memory network for performing semantic analysis; wherein the input of each timestamp of the second long and short term memory network comprises: a hidden layer output of a first long-short time memory network, an output of a timestamp on a second long-short time memory network and a word vector in the problem information;

and outputting answers aiming at the questions based on the second long-short time memory network, outputting a calculation result to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.

According to another aspect of the present application, there is provided a multi-attention based visual question-answering system, comprising:

the information acquisition module is configured to acquire picture information in the visual question and answer to be processed and question information corresponding to the picture information;

a picture feature extraction module configured to extract picture feature data in the picture information;

a picture content acquisition module configured to simultaneously input the picture feature data and the question information into a first long-short time memory network based on an attention mechanism, and acquire picture content information corresponding to the picture feature data by assigning attention weights;

and the answer output module is configured to communicate a second long-short-term memory network for performing semantic analysis with the first long-short-term memory network so as to perform reasoning combination on the picture content information and the question information, and obtain and output the answer of the to-be-processed visual question and answer.

Optionally, the picture feature extraction module is further configured to:

Optionally, the system further comprises:

a pre-training module configured to pre-train the R-CNN and/or ResNet based on a preset data set; fusing each region feature of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.

Optionally, the picture content obtaining module is further configured to:

inputting the picture feature data into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;

Optionally, the answer output module is further configured to:

The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing the combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed.

Based on the multi-attention-based visual question-answering method and system provided by the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart of a multi-attention based visual question answering method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a bidirectional LSTM workflow according to an embodiment of the present application;

FIG. 3 is a schematic diagram of picture information in a visual question answering according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a multi-attention-based visual question-answering system according to an embodiment of the present application.

FIG. 5 is a schematic diagram of a multi-attention based visual question-answering system in accordance with a preferred embodiment of the present application;

FIG. 6 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present social situation.

Detailed Description

Fig. 1 is a flow chart of a multi-attention-based visual question answering method according to an embodiment of the present application. As known from fig. 1, a method for multiple attention-based visual question answering provided in an embodiment of the present application may include:

step S101: acquiring picture information in a visual question and answer to be processed and question information corresponding to the picture information;

step S102: extracting picture characteristic data in the picture information;

step S103: simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weight;

step S104: and communicating the second long-short-term memory network for executing semantic analysis with the first long-short-term memory network to perform reasoning combination on the picture content information and the question information to obtain and output answers of the visual questions and answers to be processed.

The method comprises the steps of firstly obtaining picture information and corresponding question information in a visual question and answer to be processed, extracting picture characteristic data in the picture information, then simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism, obtaining picture content information by distributing attention weights, then completing combination of questions and pictures through two bidirectional long-short time memory networks, and outputting answers of the visual question and answer to be processed.

Long Short-Term Memory (LSTM) is a time recurrent neural network, suitable for processing and predicting important events with relatively Long intervals and delays in time series, and an LSTM-based system can learn tasks of translating languages, controlling robots, image analysis, document summarization, speech recognition, image recognition, handwriting recognition, controlling chat robots, predicting diseases, click rates and stocks, synthesizing music, and the like.

For the traditional visual question-answering model deep LSTM Q + norm I model, I represents the extracted picture features, norm I represents the 1024-dimensional pixel semantic vector extracted by CNN and is normalized by L2. And (3) extracting image semantic information by CNN, then taking text semantic information contained in the question by LSTM, fusing data of the two networks, enabling the model to learn the meaning of the question, and finally sending the meaning of the question into a multilayer MLP (Multi-layer processing) as a Softmax output layer to generate answer output. Inputting images containing 2 horses and 2 persons in an outdoor scene, processing the data through CNN without a classification layer, extracting problem information by each problem part according to the input sequence of problem words through an RNN network, fusing the two compressed information, and sending the processed data into an MLP (multiple layer processing) to generate a result output (for example, the current problem is a count object problem). The model uses a 2-layer LSTM encoding problem and divides the image into regions using the VGGNet model, followed by L2 normalization of the image features. And then, transforming the image and the question features into the same feature space, fusing the information in a dot-product mode, and sending the fused information into a three-layer MLP (Multi-layer processing) with Softmax as a classifier to generate answer output. During the training process of the model, pre-CNN, only LSTM layer and the final classification network participate in the training.

The VIS + LSTM model has the basic structure that firstly CNN is used for extracting picture information, and then LSTM is used for generating a prediction result after the picture information is extracted. However, it is considered that since there is no intact standard for evaluating the accuracy of sentences of answers, they focus their attention on limited domain questions, which can use a word as an answer to a visual question and answer, and thus can regard the visual question and answer as a multi-category question, and thus can measure the answers using the existing accuracy evaluation standard.

The overall accuracy of the above mentioned models is not high, and the content and form of the answer are relatively single.

In the present embodiment, the first long-short term memory network based on attention mechanism (which may be referred to as attention LSTM hereinafter) mainly trains a model in LSTM to selectively learn the input sequence and selectively focuses on considering the corresponding relevant information in the input when the model is output. The second long-term and short-term memory network (which may be referred to as semantic LSTM hereinafter) for performing semantic analysis is mainly used for mining and learning deep concepts of texts, pictures and the like in LSTM, so that the problem of gradient disappearance can be avoided. The attention LSTM and the semantic LSTM are both preferably bidirectional LSTM, and can be communicated with each other to cooperatively complete the combination of pictures and questions in the visual question answering so as to accurately and quickly output answers to the questions.

Generally, in the visual question answering, a picture and a natural language question are input first, and then picture information is focused according to the natural language question, so that a human language is generated and output as an answer. Therefore, when the visual question and answer is solved, the above step S101 is executed first to obtain the picture information and question information in the visual question and answer.

Further, step S102 may be executed to analyze the picture information therein to extract the feature data. Optionally, it may include: extracting at least one characteristic region in the picture information by using the R-CNN; extracting at least one region feature information in each feature region through ResNet; and then, for each feature region, performing feature screening on the region feature information in the feature region according to the overlapping degree IoU, and performing average pooling on the screened region feature information to obtain feature data of each feature region.

The R-CNN is used for finding the interested characteristic region in the picture information and further extracting the regional characteristic information in the extracted characteristic region by using ResNet.

The full name of R-CNN is Region-CNN, which is the first algorithm to successfully apply deep learning to target detection. The R-CNN realizes a target detection technology based on algorithms such as a Convolutional Neural Network (CNN), linear regression, a Support Vector Machine (SVM) and the like.

ResNet is called a deep residual error network, and a jump connection is introduced at a place different from a common network, so that information of a previous residual error block can flow into a next residual error block without being blocked, information circulation is improved, and the problems of disappearing gradient and degradation caused by the fact that the network is too deep are avoided.

After extracting the region feature information of each feature region, feature screening may be performed on the region feature information of each feature region according to the overlapping degree IoU. IoU, collectively referred to as interaction overlapping Unit, is a standard for measuring the accuracy of detecting corresponding objects in a particular data set. The feature data in the region is filtered by comparing the value based on IoU with a preset threshold to further narrow down the picture information. And finally, representing the screened features by using average pooled convolution features to acquire regional feature data of each feature region, and in addition, splicing the regional feature data to acquire a feature splicing map of picture information in a question answering system and using the feature splicing map as an output result of the R-CNN.

In an optional embodiment of the invention, before feature extraction by using R-CNN and ResNet, the R-CNN and ResNet can be pre-trained to fuse the region of interest and possible classes. The method specifically comprises the following steps: pre-training the R-CNN and/or ResNet based on a preset data set; fusing each regional characteristic of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to a full connection layer of the R-CNN and/or ResNet to output so as to classify the attribute class and the non-attribute class by softmax.

In this embodiment, the parameters of the model can be optimized by pre-training the R-CNN and ResNet, so that the model can find the interested region. The preset data set may preferably be COCO, which is collectively referred to as Common Objects in countext, and is a data set that can be used for image recognition. Images in the MS COCO dataset are divided into training, validation and test sets. COCO collects images by searching 80 object classes and various scene types on a search engine. The COCO dataset now has 3 label types: object instances, object keypoints, and image references (see talking), are stored using JSON files. Compared with the existing models such as a Deeper LSTM Q + norm I model, a VIS + LSTM model and the like, the method has higher accuracy in question answering.

After the feature data of the image information is obtained, the combination of the question and the picture can be completed through two bidirectional LSTMs, and the answer is output. That is, in the method provided in this embodiment, the data of the picture may be preprocessed, the corresponding region in the picture is mapped with the corresponding category, an attention mechanism LSTM fuses the word with the specific region in the picture, the content of the specific region is analyzed, and finally the output result of the attention mechanism LSTM is taken into the semantic LSTM, and the word is inferred and combined to generate the answer to the question.

Referring to the above-described step S103, the picture feature data and the question information are simultaneously input to the attention mechanism-based LSTM, and the picture content information corresponding to the picture feature data is acquired by assigning the attention weight. This may include inputting the picture feature data and the question information simultaneously into an attention-based LSTM; wherein, at each timestamp of the attention-based LSTM, its inputs include: outputting semantic LSTM of the last timestamp, feature data of each region and LSTM of the last timestamp based on attention mechanism; the output of which comprises: assigning an attention weight to each of the region feature data; content information corresponding to each of the region feature data is then acquired based on each of the region feature data attention weights.

On each timestamp, combining the output of two LSTMs of the timestamp and the extracted picture feature data, each cell of the attention LSTM has output, and different attention weights are distributed to all region features through the passage of the timestamp, wherein the attention weights are parameters needing to be learned, and the output on each timestamp and the attention weights are combined to output data for semantic LSTM processing.

Then, step S104 is executed to communicate the semantic LSTM with the attention LSTM, so as to perform inference combination on the picture content information and the question information, and obtain and output an answer of the to-be-processed visual question-answer, which may include: simultaneously inputting the question information and the content information corresponding to the characteristic data of each area into a semantic LSTM; wherein, the input of each timestamp of the semantic LSTM comprises: attention is paid to hidden layer output of LSTM, output of a timestamp on semantic LSTM and a word vector in question information; and outputting answers aiming at the questions based on the semantic LSTM, outputting the calculation results to a softmax layer, and selecting the word vector with the maximum probability as the answer of the visual question answers to be processed for outputting.

The attention LSTM hidden layer output contains attention weight, and is a calculation result of combining the attention weight and the feature data output by the attention LSTM current time stamp.

The word vector of the above-mentioned problem to detect "? "represents the end of a sentence, whose essence is a word vector matrix, corresponding to a one-hot encoding of a word, the word vector being randomly generated without pre-training.

The attention LSTM and semantic LSTM referred to in steps S103 and S104 above are bidirectional, and as shown in fig. 2, two bidirectional LSTM workflows with time series t ═ {1,2, … …, n } may include:

S1-1，t₁at the moment, the region feature data is input into t₁Attention at time LSTM, and then t₁Output of moment attention LSTM and first word vector input t in question information₁Semantic LSTM of time;

S1-2，t₂time of day, will t₁Two LSTM inputs of a timeOut-sum region feature data input t₂Attention at time LSTM, then let t₂Output, t, at times based on attention LSTM₁Output of semantic LSTM of time of day and second word vector input t in question information₂Semantic LSTM of time;

……

S1-n，t_ntime of day, will t_n-1Two LSTM outputs at a time and a region feature data input t_nAttention at time LSTM, then let t_nOutput of attention at time LSTM, t_n-1The output of the semantic LSTM at time and the nth word vector in the question information (word vector with_nSemantic LSTM of a time of day.

Finally, from t_nThe semantic LSTM of the time of day outputs the answer to the question.

For example, suppose that the picture information in the visual question-answer is as shown in fig. 3, and the question information is "what is in bed? ", the visual question answering method provided based on this embodiment may include:

s2-1, firstly, acquiring picture information and question information in the visual question and answer;

s2-2, extracting feature data in the picture through R-CNN and ResNet, wherein a plurality of feature areas of interest in the picture are extracted through R-CNN, such as a picture area 1 desk and a picture area 2 bed; extracting a plurality of regional characteristic information in each characteristic region through ResNet; and then, screening the features in each feature area according to the overlapping degree IoU, and performing average pooling on the screened features in each area to further acquire picture feature data. Wherein R-CNN and ResNet are pre-trained;

s2-3, inputting the picture characteristic data and question information into the attention LSTM at the same time, and obtaining the picture content information, such as computer, desk lamp, bookshelf, book, etc. in the area 1, and book, medicine, etc. in the area 2. Further, the feature data included in each region is also assigned with a weight, such as a book-to-weight ratio, a medicine-to-weight ratio, and the like in the region 2; combining the output of attention LSTM with attention weight to output the final picture content information, such as the picture content information obtained by assigning different weights in region 2 as book

And S2-4, bidirectionally communicating the semantic LSTM with the attention LSTM, reasoning and combining picture content information and question information, outputting the result of the last step of the semantic LSTM to a softmax layer, and selecting the word vector with the maximum probability as an answer to output. The question points to area 2 and answer books are output.

Based on the same inventive concept, as shown in fig. 4, an embodiment of the present application further provides a multi-attention-based visual question-answering system 400, including:

an information obtaining module 410 configured to obtain picture information in the to-be-processed visual question and question information corresponding to the picture information;

a picture feature extraction module 420 configured to extract picture feature data in the picture information;

a picture content acquiring module 430 configured to simultaneously input the picture feature data and the question information into an attention mechanism-based LSTM, and acquire picture content information corresponding to the picture feature data by assigning attention weights;

an answer output module 440 configured to communicate the semantic LSTM with the attention LSTM to perform reasoning combination on the picture content information and the question information to obtain and output an answer of the to-be-processed visual question-answer.

In an optional embodiment of the present invention, the picture feature extraction module 420 is further configured to:

extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; and for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region.

In an alternative embodiment of the present invention, as shown in fig. 5, the system may further include:

a pre-training module 450 configured to pre-train the R-CNN and/or ResNet based on a preset data set; fusing each region feature of the pictures contained in the preset data set with a vector representing a real category; and transmitting the fused vector to the full connection layer output of the R-CNN and/or ResNet for softmax classification of attribute classes and non-attribute classes.

In an optional embodiment of the present invention, the picture content acquiring module 430 is further configured to:

inputting the picture characteristic data into an LSTM based on an attention mechanism; wherein, at each timestamp of attention LSTM, its inputs include: outputting semantic LSTM of the last timestamp, feature data of each region and attention LSTM of the last timestamp; the output of which comprises: assigning an attention weight to each of the region feature data;

content information corresponding to each of the area feature data is acquired based on each of the area feature data attention weights.

In an optional embodiment of the present invention, the answer output module 440 is further configured to:

simultaneously inputting the question information and the content information corresponding to the characteristic data of each area into a semantic LSTM; wherein, the input of each timestamp of the semantic LSTM comprises: attention is paid to hidden layer output of LSTM, output of a timestamp on semantic LSTM and a word vector in question information;

and outputting answers aiming at the questions based on the semantic LSTM, outputting the calculation results to a softmax layer, and selecting the word vector with the maximum probability as the answer of the to-be-processed visual question-answer for outputting.

The embodiment of the application combines two fields of Computer Vision (CV) and Natural Language Processing (NLP) and provides a visual question-answering method and system based on multiple attention.

Based on the multi-attention-based visual question-answering method and system provided by the embodiment of the application, some memory modules are added in the calculation process of the R-CNN network to improve the knowledge source of the model in the training process, more diversified and reasonable answers are generated, the end-to-end memory network realizes the extraction of the question information and the fusion of the prior knowledge related to the question, and the overall accuracy rate of the question answering is improved.

An embodiment of the present application also provides a computing device, referring to fig. 6, comprising a memory 620, a processor 610 and a computer program stored in said memory 620 and executable by said processor 610, the computer program being stored in a space 630 for program code in the memory 620, the computer program realizing the method step 631 according to the invention when executed by the processor 610.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 7, the computer readable storage medium comprises a storage unit for program code provided with a program 631' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A visual question-answering method based on multiple attention comprises the following steps:

extracting picture characteristic data in the picture information;

communicating a second long-short-term memory network for executing semantic analysis with the first long-short-term memory network so as to carry out reasoning combination on the picture content information and the question information, and obtaining and outputting an answer of the to-be-processed visual question and answer;

the extracting of the picture feature data in the picture information includes:

extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region;

the step of simultaneously inputting the picture characteristic data and the question information into a first long-short term memory network based on an attention mechanism, and acquiring picture content information corresponding to the picture characteristic data by distributing attention weights includes:

simultaneously inputting the picture characteristic data and the question information into a first long-short time memory network based on an attention mechanism; wherein, at each timestamp of the first long-and-short time memory network, the input comprises: outputting a second long-short time memory network of a previous timestamp, characteristic data of each region and outputting a first long-short time memory network of the previous timestamp; the output of which comprises: assigning an attention weight to each of the region feature data; acquiring content information corresponding to each of the region feature data based on each of the region feature data attention weights.

2. The method according to claim 1, wherein before extracting the picture feature data in the picture information, the method further comprises:

pre-training the R-CNN and/or ResNet based on a preset data set;

3. The method according to claim 1, wherein the communicating the second long-term and short-term memory network for performing semantic analysis with the first long-term and short-term memory network to perform inference combination on the picture content information and the question information to obtain and output the answer of the to-be-processed visual question-answer comprises:

4. A multi-attention based visual question-answering system comprising:

the answer output module is configured to communicate a second long-short-term memory network for performing semantic analysis with the first long-short-term memory network so as to perform reasoning combination on the picture content information and the question information to obtain and output an answer of the to-be-processed visual question and answer;

the picture feature extraction module further configured to: extracting at least one characteristic region in the picture information by using a regional convolutional neural network R-CNN; extracting at least one region feature information in each feature region through ResNet; for each characteristic region, performing characteristic screening on the region characteristic information in the characteristic region according to the overlapping degree IoU, and performing average pooling on the screened region characteristic information to further obtain region characteristic data of each characteristic region;

the picture content acquisition module is further configured to:

5. The system of claim 4, further comprising:

6. The system of claim 4, wherein the answer output module is further configured to: