CN110263912B - Image question-answering method based on multi-target association depth reasoning - Google Patents

Image question-answering method based on multi-target association depth reasoning Download PDF

Info

Publication number
CN110263912B
CN110263912B CN201910398140.1A CN201910398140A CN110263912B CN 110263912 B CN110263912 B CN 110263912B CN 201910398140 A CN201910398140 A CN 201910398140A CN 110263912 B CN110263912 B CN 110263912B
Authority
CN
China
Prior art keywords
image
question
feature
vector
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910398140.1A
Other languages
Chinese (zh)
Other versions
CN110263912A (en
Inventor
余宙
俞俊
汪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910398140.1A priority Critical patent/CN110263912B/en
Publication of CN110263912A publication Critical patent/CN110263912A/en
Application granted granted Critical
Publication of CN110263912B publication Critical patent/CN110263912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image question and answer method based on multi-target association depth reasoning. The invention comprises the following steps: 1. and 2, performing data preprocessing on the image and the text described in the natural language of the image, and performing attention mechanism reordering on each target based on the candidate box geometric feature enhanced adaptive attention module model. 3. Neural network structure based on AAM model. 4. And (4) model training, namely training neural network parameters by using a back propagation algorithm. The invention provides a deep neural network for image question answering, in particular a method for uniformly modeling image-question text data, reasoning on characteristics of each target in an image, reordering the attention mechanism of each target so as to answer questions more accurately, and obtaining better effect in the field of image question answering.

Description

Image question-answering method based on multi-target association depth reasoning
Technical Field
The invention relates to a deep neural network structure for an image Question Answering (Visual Question Answering) task, in particular to a method for carrying out unified modeling on image-Question Answering data, searching interaction relations between entity features and corresponding space position geometric features in an image and achieving adaptive attention weight adjustment through modeling on position relations between the entity features and the corresponding space position geometric features.
Background
Image question-answering is an emerging task that intersects computer vision and natural language processing. The task is to allow the machine to automatically answer the corresponding answer by giving a question related to the image. The task of image question-answering is undoubtedly more complex than another cross-task of computer vision and natural language processing, image description, which requires a machine to be able to understand images and questions and reason about the correct results. Such as "what color is her glasses? Such sentences contain rich semantic information. In order to answer the question, the machine needs to locate the area of the female eye in the image, and then answer according to the keyword of "color". For another example, "what do the beard is made? "this problem, the machine needs to be unable to directly find the location of the beard, but can estimate the area where the beard should be located according to the location of the face and pay attention to the area. This question is then answered according to the keyword "make".
With the rapid development of deep learning in recent years, end-to-end modeling using a deep Convolutional Neural Network (CNN) or a deep cyclic Neural network (RNN) is becoming the mainstream research direction in the field of computer vision and natural language processing at present. In the research process of the image question-answering algorithm, an end-to-end modeling idea is introduced, meanwhile, the image is subjected to end-to-end modeling by using a proper network structure, and a computer automatically answers the image according to input questions and the image, so that the research question is worth of deep exploration.
For many years, it has been well recognized in the field of computer vision that contextual information or associations between objects contribute to model enhancements. But most methods of using this information have preceded the popularity of deep learning. In the current deep learning era, no significant progress is made in the field of using relationship information between objects, particularly image question answering, and most methods still focus on respectively paying attention to entities. Because the object in the image has the changes of two-dimensional space position, scale/aspect ratio and the like, the image question-answering model needs to infer the problem depending on the interrelation between the entities. Therefore, the position information of the object, i.e., the geometric features in general, plays a complex and important role in the image question-answering model.
In the aspect of practical application, the image question-answering algorithm has wide application scenes. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content automatic question-answering system based on visual perception may become an important way for human-computer interaction. The technology can help us, especially the disabled with visual impairment to better perceive and understand the world
In conclusion, the image question-answering algorithm based on end-to-end modeling is a direction worthy of intensive research, the subject is to be switched in from a plurality of key difficult problems in the task, the problems existing in the current method are solved, and finally a set of complete image question-answering system is formed.
Due to the fact that image content under a natural scene is complex, a main body is various; the description based on natural language has high degree of freedom, which makes the description of image content face huge challenge. Specifically, there are two main difficulties:
(1) the feature extraction problem is a classic and basic problem in the cross-media expression research direction, and commonly used methods are image processing feature extraction methods such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), Haar features and the like. In addition, the features extracted by ResNet, GoogleNet and fast-RCNN models based on deep learning theory all play excellent effects in many fields, such as image fine-grained classification, natural language processing and recommendation systems. Therefore, selecting a proper strategy during cross-media data feature extraction, and improving the expression capability of features while ensuring the high efficiency of calculation is a direction worthy of intensive research.
(2) How to reason the problem by relying on the interrelationship between entities in the image: the input to the image question-and-answer algorithm is an image, which may have multiple target entities, and a question. The algorithm not only extracts the characteristics of each target entity in the image and correctly understands each target of the image, but also infers the relation between the targets by using the geometric characteristics and the visual characteristics of the target characteristics. Therefore, how to lead the algorithm to automatically learn the relation among all targets of the image and form more accurate cross-media expression characteristics is a difficult problem in the image question-answering algorithm and is also a crucial link influencing the performance of the algorithm result.
Disclosure of Invention
The invention provides an image question and answer method based on multi-target association depth reasoning. The invention relates to a deep neural network architecture for an image Question Answering (Visual Question Answering) task, which mainly comprises two points: 1. and adopting image characteristics with stronger expressive power and geometric information. 2. And reasoning the relation between the targets in the image by using the target characteristics in the image.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step (1), data preprocessing, and feature extraction of image and text data
Firstly, preprocessing an image:
target entities contained in the images are detected using a fast-RCNN deep neural network structure. And extracting the visual features V and the geometric features G containing the target size and coordinate information in the image.
Preprocessing the text data:
counting sentence length of a given question text sets the maximum length of the question text according to the statistical information. And constructing a problem text vocabulary dictionary, replacing the words of the problem with index values in the description vocabulary dictionary, and then passing through the LSTM, thereby converting the problem text into a vector q.
Step (2), attention module based on candidate frame geometric feature enhancement
The structure is shown in fig. 1, and the geometric feature G, the visual feature V, and the attention weight vector m for the three feature candidate box positions that are input.
Firstly, sequentially coding an attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding the vectors to a visual feature V mapped to the high dimension, and obtaining V by Layer Normalization (Layer Normalization) processing of the output of the vectorsA
Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain GR. Will VAAnd GRInputting a candidate frame Relation component (relationship Module) to carry out reasoning to obtain Orelation. Mixing O withrelationMultiplying the linear layer and sigmoid function with the original attention weight vector m to obtain a new attention vector
Figure GDA0002820731840000041
Step (3) constructing a deep neural network
The structure of the method is shown in FIG. 2, and firstly, the problem text is converted into an index value vector according to a vocabulary dictionary. Then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using the Faster R-CNN are fused in a Hadamard product (Hadamard product) mode, and the attention weight m of each entity feature is obtained through an attention module. Inputting the Attention weight m, the visual feature V and the geometric feature G into an Adaptive Attention Module (AAM) enhanced based on the geometric features of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the Attention weight, and obtaining a new Attention vector
Figure GDA0002820731840000042
Attention vector
Figure GDA0002820731840000043
Fusing the product with the visual feature V and then carrying out weighted average to obtain new visual features
Figure GDA0002820731840000044
Characterizing visual features
Figure GDA0002820731840000045
And fusing the problem text vector q with the problem text vector q through a Hadamard product to generate probability through a softmax function, and outputting the probability as an output predicted value of the network.
Step (4), model training
And (4) training the model parameters of the neural network in the step (3) by utilizing a back propagation algorithm according to the difference between the generated predicted value and the actual description of the image until the whole network model converges.
The step (1) is specifically realized as follows:
1-1, extracting the features of the image i by using an existing deep neural network fast-RCNN, wherein the extracted features comprise visual features V and geometric features G of k targets contained in the image, and V ═ V ═ V { (V) } V1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]And the visual vector of the single target is
Figure GDA0002820731840000051
The geometric feature of the individual target is giX, y, w, h, wherein
Figure GDA0002820731840000052
Wherein x, y, w and h are position parameters of geometric features and respectively represent the abscissa, the ordinate, the width and the height of a candidate frame where an entity in the image is located;
1-2. for a given question text, the different words in the question text in the data set are first counted and recorded in a dictionary. Converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:
Figure GDA0002820731840000053
wherein
Figure GDA0002820731840000054
Is the word wkThe index value in the dictionary, l, represents the length of the question text.
The adaptive attention module deep inference network based on candidate box geometric feature enhancement in step (2) specifically comprises the following steps:
2-1. the input attention weight vector m is first processed. Annotating each object in mGravity weight m { m1,m2,...,mkThe value-ordered sequence number pos of the code is encoded,
Figure GDA0002820731840000055
the specific formula is as follows:
Figure GDA0002820731840000056
wherein
Figure GDA0002820731840000057
i∈[0,1,...,d/2],pos∈[1,2,...,k]Obtaining a matrix based on the attention weight m
Figure GDA0002820731840000058
2-2, the matrix PE and the visual characteristic V are added after passing through different linear layers respectively, and the output of the matrix PE and the visual characteristic V are subjected to layer normalization processing to obtain VAThe concrete formula is as follows:
VA=LayerNorm(WPEPET+WVVT) (formula 3)
Wherein
Figure GDA0002820731840000059
2-3, performing correlation calculation on the geometric characteristic G, and obtaining G by passing the geometric characteristic G through a linear layerRThe concrete formula is as follows:
GR=WGΩ(G)T(formula 4)
Figure GDA00028207318400000510
Wherein m, n belongs to [1, 2]GE is encoded using equation (2),
Figure GDA0002820731840000061
Figure GDA0002820731840000062
2-4, mixing VAAnd GRThe input correlation module performs reasoning to obtain OrelationThe concrete formula is as follows:
Figure GDA0002820731840000063
Orelation=softmax(log(GR)+VR)·(WOVA+bO) (formula 7)
Wherein
Figure GDA0002820731840000064
2-5, mixing OrelationAfter passing through the full connection layer, multiplying the original attention weight m by a sigmoid function to obtain a new attention vector
Figure GDA00028207318400000611
The specific formula is as follows:
Figure GDA0002820731840000065
wherein
Figure GDA0002820731840000066
Constructing a deep neural network in the step (3), which comprises the following specific steps:
3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, FfusionRepresenting a fused feature on a common space. WrAnd WqRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information q
Figure GDA00028207318400000612
Representing the two matrices using the hadamard product.WmRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention-weight distribution,
Figure GDA0002820731840000067
the initial attention weight vector m, j represents the currently calculated jth region attention weight. The specific formula is as follows:
Figure GDA0002820731840000068
m=softmax(WmFfusion+bm) (formula 10)
3-2, inputting m, V and G into an adaptive attention module enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m to obtain a new attention feature
Figure GDA0002820731840000069
3-3. passing through
Figure GDA00028207318400000610
And the visual feature vector is obtained by weighted average after the feature product of V is multiplied
Figure GDA0002820731840000071
The specific formula is as follows:
Figure GDA0002820731840000072
the training model in the step (4) is as follows:
the question-answer pairs in the VQA-v2.0 dataset are answered by multiple people, so that the same question may have different correct answers. Previous image question-answering models treated the highest ticket number as the only correct answer and one-hot encoding (one-hot encoding) it. Because the correct answers have a plurality of elements, all answers to the same question are voted, and the weight of the correct answer in all correct answers is determined according to the number of votes. And using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary. Presect represents the predicted value distribution, and GT represents the true value. Then the definition is as shown:
Figure GDA0002820731840000073
the invention has the following beneficial effects:
the invention relates to a method for uniformly modeling image-description data, reasoning on characteristics of each target in an image, and reordering attention mechanisms of each target so as to more accurately describe the image. The invention introduces the implicit geometric characteristics in the image for the first time and structures the image, so that the image and the solid characteristics in the image are subjected to cooperative reasoning, and the accuracy of the visual question-answering model can be effectively improved after the existing visual question-answering technology is combined.
The invention has smaller parameter quantity, light weight and high efficiency, is beneficial to more efficient distributed training and is beneficial to being deployed in specific hardware with limited memory.
Drawings
FIG. 1: an adaptive attention module enhanced based on candidate box geometric features;
FIG. 2: and (3) image question-answering neural network architecture of the adaptive attention module enhanced based on the geometrical characteristics of the candidate box.
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
The invention provides a deep neural network framework aiming at image Question Answering (Visual Question Answering).
The data preprocessing and the feature extraction of the image and the text in the step (1) are specifically as follows:
1-1. for feature extraction of image data, we used the MS-COCO dataset as training and testing data and extracted its visual features using the existing fast-RCNN model. Specifically, the image data is input into the Faster-RCNN network, and the 10 ∞ m in the image is detected by using the Faster-RCNN modelCombining 100 targets, framing each target, extracting 2048-dimensional visual feature V from the image of each target, and recording the coordinates and the size { x, y, w, h } of the frame of each icon as the geometric feature G of the target, wherein V ═ { V }1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]。
1-2. for the question text, firstly, different words in the question text in the data set are counted, and 9847 words with the word frequency higher than 5 are recorded in the dictionary by all the words in the text.
1-3, only the first 16 words are taken for each question sentence, and if the question sentence is less than 16 words, the null characters are supplemented. The translation of the string from between values is done using each word to generate the index value in the word dictionary in 1-2 instead of the word, so that each question translates into 16 word index vectors.
Step (2) learning and associating the target feature V and the geometric feature G of the image based on an Adaptive Attention Module (AAM) model enhanced by the geometric features of the candidate frame so as to reorder the input original Attention information m, which is specifically as follows:
2-1, firstly, processing the input attention weight vector m, and processing the attention information { m in each target in m1,m2,...,mkThe serial number pos of the value sequencing of the attention information m is coded to obtain a matrix based on the attention information m
Figure GDA0002820731840000081
2-2, mapping PE to 128 dimensions and adding V mapped to 128 dimensions, and obtaining matrix V with the size of 100x128 through layer normalization processing of outputA
2-3, performing correlation calculation on the characteristic G, encoding by a formula (2) to obtain a matrix of 100x100x64 dimensions, mapping the last dimension of the matrix to a single value, and then obtaining a matrix G of 100x100 dimensions through an activation function ReLUR
2-4, mixing VAAnd GRThe input association (relationship) module performs inference by first reasoning about VAOf each targetFeatures are mapped to 128 dimensions, and then the target features are point-multiplied with each other to obtain a matrix V of 100x100R. According to VRAnd GRThe combined calculation results in a matrix of 100x100 and VAWeighted averaging of each target in (a) yields a matrix O of 100x128relation
2-5, mixing OrelationAfter passing through the full connection layer, sigmoid is multiplied by original m to obtain new 100-dimensional
Figure GDA0002820731840000091
Constructing a deep neural network in the step (3), which comprises the following specific steps:
3-1, for the problem text characteristics, wherein the text input is the 16-dimensional index value vector generated in the step (1), a word embedding technology is used for converting each word index into a corresponding word vector, and the size of the word vector used by the word vector is 1024. Each question text becomes a matrix of size 16x 1024. And zero-filling the input visual features into a matrix of 100x2048, mapping the matrix into a matrix of 100x1024 through a linear layer, and then taking the word vector at each moment as the input of an LSTM, wherein the LSTM is a recurrent neural network structure, and setting the output of the LSTM as a 1024-dimensional vector q.
And 3-2, inputting the output vector q of the LSTM into an Attention module to obtain a preliminary 100-dimensional Attention feature m, and finishing the image Attention point information extraction (Attention) operation.
3-3, inputting m, V and G into an Adaptive Attention Module (AAM) model enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m, and obtaining a new 100-dimensional Attention feature
Figure GDA0002820731840000092
Up to this point, the operations of reasoning about the associations between objects in the image and reordering the points of interest (attentions) are completed.
3-4. passing 100-dimensional vector
Figure GDA0002820731840000093
Weighted average is carried out on the feature V with 100x1024 dimensions to obtain 1024-dimensional visual features with attention
Figure GDA0002820731840000094
3-5. We will generate the reordered visual features with attention information
Figure GDA0002820731840000095
Fusing with an output vector q of the LSTM, and sequentially performing FC layer and softmax operations, wherein FC is a neural network full-connection operation, and finally outputting a 9487-dimensional prediction vector of the word, wherein each element in the output represents a probability value for predicting that the answer corresponding to the element index is the answer of the given question.
The training model in the step (4) is as follows:
and (3) comparing the predicted 9487 dimensional vector generated in the step (3) with a correct answer of the question, calculating the difference between a predicted value and an actual correct value through a loss function defined by the user to form a loss value, and adjusting the parameter value of the whole network by using a BP algorithm according to the loss value so as to gradually reduce the difference between the predicted value and the actual value generated by the network until the network converges.

Claims (5)

1. An image question-answering method based on multi-target association depth reasoning is characterized by comprising the following steps:
step (1), data preprocessing, and feature extraction of image and text data
Firstly, preprocessing an image:
detecting a target entity contained in the image by using a fast-RCNN deep neural network structure; extracting visual features V and geometric features G containing the size and coordinate information of each target in the image;
preprocessing the text data:
counting the sentence length of a given question text, and setting the maximum length of the question text according to statistical information; constructing a problem text vocabulary dictionary, replacing words of a problem with index values in a description vocabulary dictionary, and then converting the problem text into a vector q through an LSTM;
step (2), attention module based on candidate frame geometric feature enhancement
Geometric feature G, visual feature V and attention weight vector m for the three input feature candidate box positions;
firstly, sequentially coding attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding visual features V mapped to the high dimension, and obtaining V by layer normalization processing of the output of the vectorsA
Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain GR(ii) a Will VAAnd GRThe input candidate box relation component carries out reasoning to obtain OrelationIntroducing OrelationMultiplying the original attention weight vector m by a linear layer and sigmoid function to obtain a new attention weight vector
Figure FDA0002820731830000011
Step (3) constructing a deep neural network
Firstly, converting a problem text into an index value vector according to a vocabulary dictionary; then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using fast R-CNN are fused in a Hadamard product (Hadamard product) mode, and an attention weight vector m of each entity feature is obtained through an attention module; inputting the attention weight vector m, the visual feature V and the geometric feature G into an adaptive attention module based on the geometric feature enhancement of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the attention weight vector to obtain a new attention weight vector
Figure FDA0002820731830000021
Attention weight vector
Figure FDA0002820731830000022
Fusing the product with the visual feature V and then carrying out weighted average to obtain new visual features
Figure FDA0002820731830000023
Characterizing visual features
Figure FDA0002820731830000024
Generating probability through a softmax function by fusing the problem text vector q with a Hadamard product, and outputting the probability as an output predicted value of the network;
step (4), model training
And (4) training the model parameters of the neural network in the step (3) by utilizing a back propagation algorithm according to the difference between the generated predicted value and the actual description of the image until the whole network model converges.
2. The image question-answering method based on multi-target association depth reasoning according to claim 1, characterized in that the step (1) is implemented as follows:
1-1, extracting the features of the image i by using an existing deep neural network fast-RCNN, wherein the extracted features comprise visual features V and geometric features G of k targets contained in the image, and V ═ V ═ V { (V) } V1,v2,...,vk},G={g1,g2,...,gk},k∈[10,100]And the visual vector of the single target is
Figure FDA0002820731830000025
The geometric feature of the individual target is giX, y, w, h, wherein
Figure FDA0002820731830000026
Wherein x, y, w and h are position parameters of geometric features and respectively represent the abscissa, the ordinate, the width and the height of a candidate frame where an entity in the image is located;
1-2, for a given problem text, firstly counting different words in the problem text in a data set, and recording the words in a dictionary; converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:
Figure FDA0002820731830000027
wherein
Figure FDA0002820731830000028
Is the word wkThe index value in the dictionary, l, represents the length of the question text.
3. The image question-answering method based on multi-objective association depth reasoning according to claim 2, wherein the adaptive attention module depth reasoning network based on candidate box geometric feature enhancement in step (2) is specifically as follows:
2-1, firstly processing an input attention weight vector m; weighting each target attention in m into a vector m { m }1,m2,...,mkThe value-ordered sequence number pos of the code is encoded,
Figure FDA0002820731830000031
the specific formula is as follows:
Figure FDA0002820731830000032
wherein
Figure FDA0002820731830000033
i∈[0,1,...,d/2]pos∈[1,2,...,k]Obtaining a matrix based on the attention weight vector m
Figure FDA0002820731830000034
2-2, the matrix PE and the visual characteristic V are added after passing through different linear layers respectively, and the output of the matrix PE and the visual characteristic V are subjected to layer normalization processing to obtain VAThe concrete formula is as follows:
VA=Layer Norm(WPEPET+WVVT) (formula 3)
Wherein
Figure FDA0002820731830000035
2-3, performing correlation calculation on the geometric characteristic G, and obtaining G by passing the geometric characteristic G through a linear layerRThe concrete formula is as follows:
GR=WGΩ(G)T(formula 4)
Figure FDA0002820731830000036
Wherein m, n belongs to [1, 2]GE is encoded using equation (2),
Figure FDA0002820731830000037
Figure FDA0002820731830000038
2-4, mixing VAAnd GRThe input correlation module performs reasoning to obtain OrelationThe concrete formula is as follows:
Figure FDA0002820731830000039
Orelation=softmax(log(GR)+VR)·(WOVA+bO) (formula 7)
Wherein
Figure FDA00028207318300000310
2-5, mixing OrelationAfter passing through the full connection layer, multiplying the original attention weight vector m by a sigmoid function to obtain a new attention weight vector
Figure FDA00028207318300000311
The specific formula is as follows:
Figure FDA0002820731830000041
wherein
Figure FDA0002820731830000042
4. The image question-answering method based on multi-objective association depth reasoning according to claim 3, wherein the deep neural network is constructed in the step (3), and specifically comprises the following steps:
3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, FfusionRepresenting a fused feature on a common space; wrAnd WqRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information q
Figure FDA0002820731830000049
Expressing that the two matrixes adopt Hadamard products; wmRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention weight vector distribution,
Figure FDA0002820731830000043
an initial attention weight vector m, j represents the current calculated jth region attention weight vector; the specific formula is as follows:
Figure FDA0002820731830000044
m=softmax(WmFfusion+bm) (formula 10)
3-2, inputting m, V and G into an adaptive attention module enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m to obtain a new attention feature
Figure FDA0002820731830000045
3-3. passing through
Figure FDA0002820731830000046
And the visual feature vector is obtained by weighted average after the feature product of V is multiplied
Figure FDA0002820731830000047
The specific formula is as follows:
Figure FDA0002820731830000048
5. the image question-answering method based on multi-target association depth reasoning according to claim 4, wherein the model training in the step (4) is as follows:
VQA-v2.0 data sets of question-answer pairs are answered by multiple people, so that the same question may have different correct answers; the previous image question-answering model regards the highest ticket number as the only correct answer and carries out one-hot encoding (one-hot encoding) on the answer; because the correct answers have the diversity, all answers of the same question are voted, and the weight of the correct answers in all correct answers is determined according to the number of votes; and using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary; presect represents the predicted value distribution, GT represents the true value; then the definition is as shown:
Figure FDA0002820731830000051
CN201910398140.1A 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning Active CN110263912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910398140.1A CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910398140.1A CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Publications (2)

Publication Number Publication Date
CN110263912A CN110263912A (en) 2019-09-20
CN110263912B true CN110263912B (en) 2021-02-26

Family

ID=67914695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910398140.1A Active CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Country Status (1)

Country Link
CN (1) CN110263912B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879844B (en) * 2019-10-25 2022-10-14 北京大学 Cross-media reasoning method and system based on heterogeneous interactive learning
CN110889505B (en) * 2019-11-18 2023-05-02 北京大学 Cross-media comprehensive reasoning method and system for image-text sequence matching
CN111598118B (en) * 2019-12-10 2023-07-07 中山大学 Visual question-answering task implementation method and system
CN111553372B (en) * 2020-04-24 2023-08-08 北京搜狗科技发展有限公司 Training image recognition network, image recognition searching method and related device
CN111737458A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Intention identification method, device and equipment based on attention mechanism and storage medium
CN111611367B (en) * 2020-05-21 2023-04-28 拾音智能科技有限公司 Visual question-answering method introducing external knowledge
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN111897939B (en) * 2020-08-12 2024-02-02 腾讯科技(深圳)有限公司 Visual dialogue method, training method, device and equipment for visual dialogue model
CN112016493A (en) * 2020-09-03 2020-12-01 科大讯飞股份有限公司 Image description method and device, electronic equipment and storage medium
CN112309528B (en) * 2020-10-27 2023-04-07 上海交通大学 Medical image report generation method based on visual question-answering method
CN113010712B (en) * 2021-03-04 2022-12-02 天津大学 Visual question answering method based on multi-graph fusion
CN113094484A (en) * 2021-04-07 2021-07-09 西北工业大学 Text visual question-answering implementation method based on heterogeneous graph neural network
CN113326933B (en) * 2021-05-08 2022-08-09 清华大学 Attention mechanism-based object operation instruction following learning method and device
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113392253B (en) * 2021-06-28 2023-09-29 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium
CN113515615A (en) * 2021-07-09 2021-10-19 天津大学 Visual question-answering method based on capsule self-guide cooperative attention mechanism
CN113792703B (en) * 2021-09-29 2024-02-02 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention depth modular network
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114564958B (en) * 2022-01-11 2023-08-04 平安科技(深圳)有限公司 Text recognition method, device, equipment and medium
CN117274616B (en) * 2023-09-26 2024-03-29 南京信息工程大学 Multi-feature fusion deep learning service QoS prediction system and prediction method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366166B2 (en) * 2017-09-07 2019-07-30 Baidu Usa Llc Deep compositional frameworks for human-like language acquisition in virtual environments
CN109829049B (en) * 2019-01-28 2021-06-01 杭州一知智能科技有限公司 Method for solving video question-answering task by using knowledge base progressive space-time attention network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering》;Kan Chen,et al;《arXiv:1511.05960v2》;20160403;全文 *
《Attention Is All You Need》;Ashish Vaswani,et al;《arXiv:1706.03762v5》;20171206;全文 *
《基于深度神经网络和注意力机制的图像问答研究》;李庆;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第1期);全文 *
《视觉问答技术研究》;俞俊,等;《计算机研究与发展》;20181231;第55卷(第9期);全文 *

Also Published As

Publication number Publication date
CN110263912A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN110334705B (en) Language identification method of scene text image combining global and local information
CN107480206B (en) Multi-mode low-rank bilinear pooling-based image content question-answering method
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN112464004A (en) Multi-view depth generation image clustering method
CN113204633B (en) Semantic matching distillation method and device
CN113761153A (en) Question and answer processing method and device based on picture, readable medium and electronic equipment
CN115222998B (en) Image classification method
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN115563327A (en) Zero sample cross-modal retrieval method based on Transformer network selective distillation
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN116580440A (en) Lightweight lip language identification method based on visual transducer
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN114781503A (en) Click rate estimation method based on depth feature fusion
CN113837290A (en) Unsupervised unpaired image translation method based on attention generator network
Jiang et al. Cross-level reinforced attention network for person re-identification
CN116167014A (en) Multi-mode associated emotion recognition method and system based on vision and voice
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
Miao et al. Chinese font migration combining local and global features learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant