CN110263912A - A kind of image answering method based on multiple target association depth reasoning - Google Patents
A kind of image answering method based on multiple target association depth reasoning Download PDFInfo
- Publication number
- CN110263912A CN110263912A CN201910398140.1A CN201910398140A CN110263912A CN 110263912 A CN110263912 A CN 110263912A CN 201910398140 A CN201910398140 A CN 201910398140A CN 110263912 A CN110263912 A CN 110263912A
- Authority
- CN
- China
- Prior art keywords
- image
- vector
- feature
- attention
- weight vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of image answering methods based on multiple target association depth reasoning.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the adaptability attention modular model based on the enhancing of candidate frame geometrical characteristic, the attention mechanism for carrying out each target reorders.3, the neural network structure based on AAM model.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for image question and answer, especially propose that a kind of pair of image-question text data carry out unified Modeling, it is made inferences in each target signature in the picture, the method for reordering more accurately to answer problem to the attention mechanism of each target, and the acquisition better effects in image question and answer field.
Description
Technical field
The present invention relates to a kind of depth nerves for image question and answer (Visual Question Answering) task
Network structure more particularly to a kind of pair of image-question and answer data carry out unified Modeling, find each substance feature and phase in image
Interaction relationship between corresponding spatial position geometrical characteristic reaches adaptation by modeling to the positional relationship between them
Property adjustment attention weight method.
Background technique
Image question and answer are the emerging tasks of a cross computer vision and natural language processing.The task is intended to pass through
A problem related to image is given, allows machine that can answer corresponding answer automatically.With another computer vision and natural language
Speech processing intersection task --- iamge description is compared, need machine can be by understanding image and problem and reasoning obtains just
It is true as a result, therefore image question-answering task is undoubtedly increasingly complex.Such as " what color her glasses are? " such sentence includes
Semantic information abundant.Machine is in order to answer this problem, it is necessary first to navigate to the area where women eye in the picture
Then domain is answered according to " color " this keyword.For another example " beard be by what make? " this problem, machine need
Want can not to directly find beard position, but can according to the location estimation of face to beard should region and right
It is paid close attention in the region.Then this problem is answered according to keyword " production ".
With the rapid development of deep learning in recent years, depth convolutional neural networks (Convolutional is used
Neural Networks, CNN) or deep-cycle neural network (Recurrent Neural Networks, RNN) held
The mainstream research direction of current computer vision and natural language processing field is modeled as to end (end-to-end).?
In the research process of image question and answer algorithm, the thought of end-to-end modeling is introduced, while network structure appropriate is used to image
Carry out end-to-end modeling, allow computer according to input the problem of and image answers automatically be one be worth further investigate research
Problem.
For many years, it is closed in the association being fully recognized that in computer vision field between contextual information or object
System facilitates the enhancing of model.But most of methods using the information all deep learning it is universal before.Current depth
The study epoch are spent, are not made substantial progress using the relation information between object, especially image question and answer domain variability, mostly
Counting method, which is still absorbed in respectively to apply entity, to be paid close attention to.Since object has two-dimensional spatial location and scale/vertical in image
The variations such as horizontal ratio, and the correlation that image Question-Answering Model needs to rely between entity makes inferences problem.Therefore the position of object
Confidence breath namely geometrical characteristic in general sense play complicated and important role in image Question-Answering Model.
In terms of practical application, image question and answer algorithm has a wide range of applications scene.(such as with wearable Intelligent hardware
The HoloLens of Google glasses and Microsoft) and the fast development of augmented reality be based in the near future
The picture material automatically request-answering system of visual perception may become a kind of important way of human-computer interaction.This technology can
To help us, especially those disabled persons visually impaired preferably perceive and understand the world
In conclusion the image question and answer algorithm based on end-to-end modeling is the direction for being worth further investigation, this project
The difficulties incision of quasi- keys several from the task, solves the problems, such as that current method exists, and ultimately form complete set
Image question answering system.
Since the picture material under natural scene is complicated, main body multiplicity;Description freedom degree based on natural language is high, this
So that picture material description faces huge challenge.Specifically, being primarily present following both sides difficult point:
(1) how progress effectively feature extraction across media data to image-problem: feature extraction problem is across matchmaker
Body surface is up to a classics in research direction and the problem on basis, common method have histograms of oriented gradients (Histogram of
Oriented Gradient, HOG), local binary patterns (Local Binary Pattern, LBP), the images such as Haar feature
Processing feature extracting method.In addition, ResNet, GoogLeNet, Faster-RCNN model based on deep learning theory
The feature of extraction is as excellent in played in image fine grit classification, natural language processing, recommender system in many fields
Effect.Therefore, suitable strategy is selected simultaneously in the high efficiency for guaranteeing to calculate to mention in across media data feature extraction
The ability to express of high feature is the direction for being worth further investigation.
(2) how to rely on the correlation in image between entity to make inferences problem: the input of image question and answer algorithm is
Image and problem, image may possess multiple target entities.Algorithm should in abstract image each target entity feature, to figure
There emerged a mesh using the geometrical characteristic and visual signature reasoning of target signature as each target is correctly understood, while also
Connection between mark.Therefore, how to allow algorithm to learn the connection between each target of image automatically, formed more precisely across matchmaker
Body expression characteristic is the difficulties in image question and answer algorithm, while being also the vital ring for influencing arithmetic result performance
Section.
Summary of the invention
The present invention provides a kind of image answering methods based on multiple target association depth reasoning.One kind is asked for image
The deep neural network framework of (Visual Question Answering) task is answered, the present invention mainly includes two o'clock: 1, adopting
Characteristics of image that is stronger with ability to express and having geological information.2, using the target signature in image to target each in image
Between relationship make inferences.
The present invention solves technical solution used by its technical problem and includes the following steps:
Step (1), data prediction extract feature to image and text data
It is to image preprocessing first:
Use the target entity for including in Faster-RCNN deep neural network structure detection image.Image zooming-out is regarded
Feel the geometrical characteristic G in feature V and image comprising each target size, coordinate information.
Text data is pre-processed:
The sentence length of given problem text is counted according to the maximum length of statistical information offering question text.Building
The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by question text vocabulary dictionary, thus
Vector q is converted by question text.
Step (2), the attention power module based on the enhancing of candidate frame geometrical characteristic
Its structure as shown in Fig. 2, geometrical characteristic G, visual signature V for three feature candidate frame positions of input and
Attention weight vectors m.
Sequential encoding is carried out to attention weight vectors m first, after converting vector according to weight size order for it,
Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output is normalized by layer
(LayerNormalization) processing obtains VA。
Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layerR.By VAAnd GRInput is waited
Frame relationship component (RelationModule) is selected to make inferences to obtain Orelation, as shown in Figure 1.By OrelationBy linear layer
It is multiplied to obtain new attention force vector with sigmoid function with original attention weight vectors m
Step (3), building deep neural network
Its structure is as shown in figure 3, will be converted to index value vector according to vocabulary dictionary in question text first.Then will
The vector is passed to shot and long term memory network (Long Short Term Memory, LSTM) by High Dimensional Mapping, outputs it
The vector q and visual signature V obtained using Faster R-CNN is melted by way of Hadamard product (Hadamard product)
It closes, and by noticing that power module obtains the attention weight m of each substance feature.By attention weight m, visual signature V and
Geometrical characteristic G input pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances
Module, AAM), it is made inferences using the geometrical characteristic of visual signature and candidate frame position, attention weight is reset
Sequence obtains new attention force vectorIt will pay attention to force vectorWeighted average is done after merging with visual signature V product to obtain newly
Visual signatureBy visual signatureIt is merged with question text vector q by Hadamard product and is generated by softmax function
Probability, and using this probability output as the output predicted value of network.
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (3)
The model parameter of middle neural network is trained, until whole network model is restrained.
Step (1) is implemented as follows:
1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, mentions
The feature taken includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v1, v2..., vk, G
={ g1, g2..., gk, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is
gi={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated in image
Abscissa, ordinate and the width and height of candidate frame where entity;
1-2. records given problem text, first word different in statistical data concentration problem text
In dictionary.The word in word list is converted to index value according to word dictionary, so that question text is converted to fixation
The index vector of length, specific formula is as follows:
WhereinIt is word wkIndex value in dictionary, l indicate the length of question text.
Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, has
Body is as follows:
2-1. is first handled the attention weight vectors m of input.By each target attention weight m { m in m1,
m2..., mkValue sequence serial number pos encoded,Itself specific formula is as follows:
WhereinI ∈ [0,1 ..., d/2], pos ∈ [1,2 ..., k], obtains the square based on attention weight m
Battle array
Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and output is normalized by layer
Processing obtains VA, specific formula is as follows:
VA=Layer Norm (WPEPET+WVVT) (formula 3)
Wherein
2-3. is associated calculating to geometrical characteristic G, passes it through linear layer and obtains GR, specific formula is as follows:
GR=WGΩ(G)T(formula 4)
Wherein, m, n ∈ [1,2 ..., k] GE are encoded using formula (2),
2-4. is by VAAnd GRInput relating module makes inferences to obtain Orelation, specific formula is as follows:
Orelation=softmax (log (GR)+VR)·(WOVA+bO) (formula 7)
Wherein
2-5. is by OrelationAfter full articulamentum, it is multiplied using sigmoid function with original attention weight m
Obtain new attention force vectorSpecific formula is as follows:
Wherein
Building deep neural network described in step (3), specific as follows:
By the linear transformation of full articulamentum, to map to public space right by question text vector q and visual signature V by 3-1.
Afterwards using Hadamard product fusion, FfusionIndicate the fusion feature on public space.WrAnd WqRespectively indicate visual signature V and
The correspondence that current state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product.Wm
The full connection layer parameter of attention weight distribution is indicated fusion feature dimensionality reduction and generates, Initial attention weight vectors m, j indicate currently to calculate j-th of region attention weight.Specific formula is as follows:
M=softmax (WmFfusion+bm) (formula 10)
M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-2.
Block is made inferences using the feature of V and G, is reordered to m, and new attention feature is obtained
3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is such as
Under:
Training pattern described in step (4), specific as follows:
Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct times
It answers.Highest poll is considered as unique correct answer by previous image Question-Answering Model, and carries out one-hot coding (one-hot to it
encoding).Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll
Weight of the correct option in whole correct options.And letter is lost using Kullback-Leibler divergence
Number, if N indicates to answer the length of vocabulary.Predict indicates that distribution of forecasting value, GT indicate true value.Then definition is as shown:
The present invention has the beneficial effect that:
The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture
Reasoning, the method for reordering that more accurately image is described to the attention mechanism of each target.The present invention draws for the first time
Enter in image imply geometrical characteristic and to its structuring, make itself and in image substance feature carry out Cooperative Reasoning, be existing
Vision question and answer technology combine after can effectively improve the accuracy rate of vision Question-Answering Model.
Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, be conducive to be deployed in
Deposit limited specific hardware.
Detailed description of the invention
Fig. 1: candidate frame relationship component (Relation Module)
Fig. 2: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to power module
Fig. 3: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to the image question and answer neural network framework of power module
Specific embodiment
Detail parameters of the invention are further elaborated with below.
As shown in Figure 1, the present invention provides a kind of depth for being directed to image question and answer (Visual Question Answering)
Neural network framework.
Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:
Feature extraction of the 1-1. for image data, we use MS-COCO data set as trained and test data,
And utilize existing its visual signature of Faster-RCNN model extraction.Specifically, we are input to image data
In Faster-RCNN network, using 10~100 targets in Faster-RCNN model inspection image and outline, to each mesh
Target image zooming-out 2048 ties up visual signature V, and the coordinate and size { x, y, w, h } that record the frame of each icon are as the mesh
Target geometrical characteristic G, wherein V={ v1, v2..., vk, G={ g1, g2..., gk, k ∈ [10,100].
Always is there is question text, first word different in statistical data concentration problem text by 1-2. in text
9847 words of all words by word frequency higher than 5 are recorded in dictionary.
1-3. only takes preceding 16 words to each problem sentence, supplements null character if problem sentence is discontented with 16 words.
The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value
Change, so that each problem is converted to 16 word index vectors.
Step (2) pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances
Module, AAM) model is between the target signature V and geometrical characteristic G of image to being learnt and be associated with to the original of input
Begin to notice that force information m reorders, specific as follows:
2-1. is first handled the attention weight vectors m of input, and each target in m is paid attention to information { m1,
m2..., mkValue sequence serial number pos encoded, obtain based on pay attention to force information m matrix
PE is mapped to 128 dimensions and is added with the V for being equally mapped to 128 dimensions by 2-2., and output is obtained by layer normalized
The matrix V for being 100x128 to sizeA。
2-3., which is associated feature G, to be calculated and obtains the matrix of 100x100x64 dimension by formula (2) coding, by the square
The last of battle array one-dimensional is mapped as obtaining the matrix G of 100x100 dimension by activation primitive ReLU after single valueR。
2-4. is by VAAnd GRInput association (Relation) module makes inferences, first by VAIn the mapping of each clarification of objective
To 128 dimensions, the mutual dot product of target signature is obtained into the matrix V of 100x100 laterR.According to VRWith GRCombined calculation obtains
The matrix of 100x100 and and VAIn each target take weighted average to obtain the matrix O of 100x128relation。
2-5. is by OrelationAfter full articulamentum, it is multiplied to obtain 100 new dimensions with original m into sigmoid is crossed
Building deep neural network described in step (3), specific as follows:
For 3-1. for question text feature, text input here is the 16 dimension index value vectors that step (1) generates, I
Each word index is converted into corresponding term vector using word embedding technology, the word that we use here to
Measuring size is 1024.Therefore each question text becomes the matrix that size is 16x1024.And by the visual signature zero padding of input
For the matrix of 100x2048, the matrix of 100x1024 is mapped as using linear layer, we are by the word at each moment later
Input of the vector as LSTM, wherein LSTM is a kind of Recognition with Recurrent Neural Network structure, we output it be set as 1024 dimensions to
Measure q.
We notice that power module obtains 100 preliminary dimension attention feature m to the output vector q input of LSTM to 3-2., until
This, image attention point information extraction (Attention) operation is completed.
M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-3.
Block (Adaptive Attention Module, AAM) model, is made inferences using the feature of V and G, is reset to m
Sequence obtains the attention feature of 100 new dimensionsSo far, between the related reasoning target in image and to focus
(attention) operation reordered is completed.
3-4. passes through 100 dimensional vectorsWeighted average, which is done, with the feature V of 100x1024 dimension obtains the band attention of 1024 dimensions
The visual signature of power
3-5. we by after the reordering of above-mentioned generation with the visual signature for paying attention to force informationOutput with LSTM to
Amount q is merged, and is successively operated by FC layers and softmax, and wherein FC is the full attended operation of neural network, final output
Word 9487 dimension predicted vectors, wherein in the output each element representation predict the corresponding answer of the element index be to
Determine the probability value of the answer of problem.
Training pattern described in step (4), specific as follows:
For 9487 dimensional vector of prediction that step (3) generate, we compare the correct option of itself and the problem, lead to
The difference between predicted value and practical right value is calculated to form penalty values, and root in the loss function for crossing our definition
According to the penalty values using BP algorithm adjustment whole network parameter value so that the network generate prediction with actual value it
Between gap be gradually reduced, until network convergence.
Claims (5)
1. a kind of image answering method based on multiple target association depth reasoning, it is characterised in that include the following steps:
Step (1), data prediction extract feature to image and text data
It is to image preprocessing first:
Use the target entity for including in Faster-RCNN deep neural network structure detection image;It is special to image zooming-out vision
Levy the geometrical characteristic G in V and image comprising each target size, coordinate information;
Text data is pre-processed:
The sentence length of given problem text is counted according to the maximum length of statistical information offering question text;Construct question text
The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by this vocabulary dictionary, thus by problem text
Originally it is converted into vector q;
Step (2), the attention power module based on the enhancing of candidate frame geometrical characteristic
Geometrical characteristic G, visual signature V and attention weight vectors vector m for three feature candidate frame positions of input;
Sequential encoding is carried out to attention weight vectors vector m first to reflect after converting vector according to weight size order for it
Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output obtains V by layer normalizedA;
Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layerR;By VAAnd GRCandidate frame is inputted to close
Module makes inferences to obtain Orelation, by OrelationBy linear layer and sigmoid function and original attention weight to
Amount vector m is multiplied to obtain new attention weight vectors vector
Step (3), building deep neural network
Index value vector will be converted to according to vocabulary dictionary in question text first;Then the vector is passed to by High Dimensional Mapping
Shot and long term memory network (Long Short Term Memory, LSTM), the vector q output it and use Faster R-CNN
The visual signature V of acquisition is merged by way of Hadamard product (Hadamard product), and by noticing that power module obtains
The attention weight vectors m of each substance feature;Attention weight vectors m, visual signature V and geometrical characteristic G input are based on
The adaptability of candidate frame geometrical characteristic enhancing pays attention to power module, is pushed away using the geometrical characteristic of visual signature and candidate frame position
Reason, reorders to attention weight vectors, obtains new attention weight vectorsBy attention weight vectorsWith view
Weighted average is done after feel feature V product fusion and obtains new visual signature V, by visual signaturePass through with question text vector q
Hadamard product fusion generates probability by softmax function, and using this probability output as the output predicted value of network;
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to refreshing in step (3)
Model parameter through network is trained, until whole network model is restrained.
2. a kind of image answering method based on multiple target association depth reasoning according to claim 1, it is characterised in that
Step (1) is implemented as follows:
1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, extraction
Feature includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v1, v2..., vk, G=
{g1, g2..., gk, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is gi
={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated real in image
Abscissa, ordinate and the width and height of candidate frame where body;
1-2. is recorded in word for given problem text, first word different in statistical data concentration problem text
In allusion quotation;The word in word list is converted to index value according to word dictionary, so that question text is converted to regular length
Index vector, specific formula is as follows:
WhereinIt is word wkIndex value in dictionary, I indicate the length of question text.
3. a kind of image answering method based on multiple target association depth reasoning according to claim 2, it is characterised in that
Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, specific as follows:
2-1. is first handled the attention weight vectors vector m of input;By each target attention weight vectors m in m
{m1, m2..., mkValue sequence serial number pos encoded,Itself specific formula is as follows:
WhereinPos ∈ [1,2 ..., k], obtains the matrix based on attention weight vectors m
Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and layer normalized is passed through in output
Obtain VA, specific formula is as follows:
VA=LayerNorm (WPEPET+WVVT) (formula 3)
Wherein
2-3. is associated calculating to geometrical characteristic G, passes it through linear layer and obtains GR, specific formula is as follows:
GR=WGΩ(G)T(formula 4)
Wherein, m, n ∈ [1,2 ..., k], GE are encoded using formula (2),
2-4. is by VAAnd GRInput relating module makes inferences to obtain Orelation, specific formula is as follows:
Qrelation=softmax (log (GR)+VR)·(WOVA+bO) (formula 7)
Wherein
2-5. is by OrelationIt is multiplied using sigmoid function and original attention weight vectors m phase after full articulamentum
To new attention weight vectorsSpecific formula is as follows:
Wherein
4. a kind of image answering method based on multiple target association depth reasoning according to claim 3, it is characterised in that
Building deep neural network described in step (3), specific as follows:
Question text vector q and visual signature V is mapped to public space by the linear transformation of full articulamentum and then made by 3-1.
It is merged with Hadamard product, FfusionIndicate the fusion feature on public space;WrAnd WqIt respectively indicates visual signature V and current shape
The correspondence that state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product;WmIndicating will
Fusion feature dimensionality reduction and the full connection layer parameter for generating the distribution of attention weight vectors,Initial attention weight vectors vector m, j indicate current and calculate jth
A region attention weight vectors;Specific formula is as follows:
M=softmax (WmFfusion+bm) (formula 10)
The adaptability that m, V and G input are enhanced based on candidate frame geometrical characteristic is paid attention to power module according to step (2) by 3-2., benefit
It is made inferences with the feature of V and G, is reordered to m, obtain new attention feature
3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is as follows:
5. a kind of image answering method based on multiple target association depth reasoning according to claim 4, it is characterised in that
Training pattern described in step (4), specific as follows:
Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct answers;First
Highest poll is considered as unique correct answer by preceding image Question-Answering Model, and carries out one-hot coding (one-hot to it
encoding);Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll
Weight of the correct option in whole correct options;And Kullback-Leibler divergence loss function is used,
If N indicates to answer the length of vocabulary;Predict indicates that distribution of forecasting value, GT indicate true value;Then definition is as shown:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910398140.1A CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910398140.1A CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110263912A true CN110263912A (en) | 2019-09-20 |
CN110263912B CN110263912B (en) | 2021-02-26 |
Family
ID=67914695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910398140.1A Active CN110263912B (en) | 2019-05-14 | 2019-05-14 | Image question-answering method based on multi-target association depth reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263912B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879844A (en) * | 2019-10-25 | 2020-03-13 | 北京大学 | Cross-media reasoning method and system based on heterogeneous interactive learning |
CN110889505A (en) * | 2019-11-18 | 2020-03-17 | 北京大学 | Cross-media comprehensive reasoning method and system for matching image-text sequences |
CN111553372A (en) * | 2020-04-24 | 2020-08-18 | 北京搜狗科技发展有限公司 | Training image recognition network, image recognition searching method and related device |
CN111598118A (en) * | 2019-12-10 | 2020-08-28 | 中山大学 | Visual question-answering task implementation method and system |
CN111611367A (en) * | 2020-05-21 | 2020-09-01 | 拾音智能科技有限公司 | Visual question answering method introducing external knowledge |
CN111737458A (en) * | 2020-05-21 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Intention identification method, device and equipment based on attention mechanism and storage medium |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN112016493A (en) * | 2020-09-03 | 2020-12-01 | 科大讯飞股份有限公司 | Image description method and device, electronic equipment and storage medium |
CN112309528A (en) * | 2020-10-27 | 2021-02-02 | 上海交通大学 | Medical image report generation method based on visual question-answering method |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113094484A (en) * | 2021-04-07 | 2021-07-09 | 西北工业大学 | Text visual question-answering implementation method based on heterogeneous graph neural network |
CN113220859A (en) * | 2021-06-01 | 2021-08-06 | 平安科技(深圳)有限公司 | Image-based question and answer method and device, computer equipment and storage medium |
CN113326933A (en) * | 2021-05-08 | 2021-08-31 | 清华大学 | Attention mechanism-based object operation instruction following learning method and device |
CN113392253A (en) * | 2021-06-28 | 2021-09-14 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
CN113515615A (en) * | 2021-07-09 | 2021-10-19 | 天津大学 | Visual question-answering method based on capsule self-guide cooperative attention mechanism |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113792703A (en) * | 2021-09-29 | 2021-12-14 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention deep modular network |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114564958A (en) * | 2022-01-11 | 2022-05-31 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and medium |
CN117274616A (en) * | 2023-09-26 | 2023-12-22 | 南京信息工程大学 | Multi-feature fusion deep learning service QoS prediction system and prediction method |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108108771A (en) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | Image answering method based on multiple dimensioned deep learning |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
US20190073353A1 (en) * | 2017-09-07 | 2019-03-07 | Baidu Usa Llc | Deep compositional frameworks for human-like language acquisition in virtual environments |
CN109472024A (en) * | 2018-10-25 | 2019-03-15 | 安徽工业大学 | A kind of file classification method based on bidirectional circulating attention neural network |
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
CN109829049A (en) * | 2019-01-28 | 2019-05-31 | 杭州一知智能科技有限公司 | The method for solving video question-answering task using the progressive space-time attention network of knowledge base |
-
2019
- 2019-05-14 CN CN201910398140.1A patent/CN110263912B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160342895A1 (en) * | 2015-05-21 | 2016-11-24 | Baidu Usa Llc | Multilingual image question answering |
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
US20190073353A1 (en) * | 2017-09-07 | 2019-03-07 | Baidu Usa Llc | Deep compositional frameworks for human-like language acquisition in virtual environments |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108108771A (en) * | 2018-01-03 | 2018-06-01 | 华南理工大学 | Image answering method based on multiple dimensioned deep learning |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
CN109472024A (en) * | 2018-10-25 | 2019-03-15 | 安徽工业大学 | A kind of file classification method based on bidirectional circulating attention neural network |
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
CN109829049A (en) * | 2019-01-28 | 2019-05-31 | 杭州一知智能科技有限公司 | The method for solving video question-answering task using the progressive space-time attention network of knowledge base |
Non-Patent Citations (4)
Title |
---|
ASHISH VASWANI,ET AL: "《Attention Is All You Need》", 《ARXIV:1706.03762V5》 * |
KAN CHEN,ET AL: "《ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering》", 《ARXIV:1511.05960V2》 * |
俞俊,等: "《视觉问答技术研究》", 《计算机研究与发展》 * |
李庆: "《基于深度神经网络和注意力机制的图像问答研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879844B (en) * | 2019-10-25 | 2022-10-14 | 北京大学 | Cross-media reasoning method and system based on heterogeneous interactive learning |
CN110879844A (en) * | 2019-10-25 | 2020-03-13 | 北京大学 | Cross-media reasoning method and system based on heterogeneous interactive learning |
CN110889505A (en) * | 2019-11-18 | 2020-03-17 | 北京大学 | Cross-media comprehensive reasoning method and system for matching image-text sequences |
CN110889505B (en) * | 2019-11-18 | 2023-05-02 | 北京大学 | Cross-media comprehensive reasoning method and system for image-text sequence matching |
CN111598118A (en) * | 2019-12-10 | 2020-08-28 | 中山大学 | Visual question-answering task implementation method and system |
CN111598118B (en) * | 2019-12-10 | 2023-07-07 | 中山大学 | Visual question-answering task implementation method and system |
CN111553372A (en) * | 2020-04-24 | 2020-08-18 | 北京搜狗科技发展有限公司 | Training image recognition network, image recognition searching method and related device |
CN111553372B (en) * | 2020-04-24 | 2023-08-08 | 北京搜狗科技发展有限公司 | Training image recognition network, image recognition searching method and related device |
CN111611367A (en) * | 2020-05-21 | 2020-09-01 | 拾音智能科技有限公司 | Visual question answering method introducing external knowledge |
CN111737458A (en) * | 2020-05-21 | 2020-10-02 | 平安国际智慧城市科技股份有限公司 | Intention identification method, device and equipment based on attention mechanism and storage medium |
CN111737458B (en) * | 2020-05-21 | 2024-05-21 | 深圳赛安特技术服务有限公司 | Attention mechanism-based intention recognition method, device, equipment and storage medium |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN111897939B (en) * | 2020-08-12 | 2024-02-02 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training method, device and equipment for visual dialogue model |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN112016493A (en) * | 2020-09-03 | 2020-12-01 | 科大讯飞股份有限公司 | Image description method and device, electronic equipment and storage medium |
CN112309528A (en) * | 2020-10-27 | 2021-02-02 | 上海交通大学 | Medical image report generation method based on visual question-answering method |
CN112309528B (en) * | 2020-10-27 | 2023-04-07 | 上海交通大学 | Medical image report generation method based on visual question-answering method |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113010712B (en) * | 2021-03-04 | 2022-12-02 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113094484A (en) * | 2021-04-07 | 2021-07-09 | 西北工业大学 | Text visual question-answering implementation method based on heterogeneous graph neural network |
CN113326933B (en) * | 2021-05-08 | 2022-08-09 | 清华大学 | Attention mechanism-based object operation instruction following learning method and device |
CN113326933A (en) * | 2021-05-08 | 2021-08-31 | 清华大学 | Attention mechanism-based object operation instruction following learning method and device |
CN113761153B (en) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Picture-based question-answering processing method and device, readable medium and electronic equipment |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113220859B (en) * | 2021-06-01 | 2024-05-10 | 平安科技(深圳)有限公司 | Question answering method and device based on image, computer equipment and storage medium |
CN113220859A (en) * | 2021-06-01 | 2021-08-06 | 平安科技(深圳)有限公司 | Image-based question and answer method and device, computer equipment and storage medium |
CN113392253A (en) * | 2021-06-28 | 2021-09-14 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
CN113392253B (en) * | 2021-06-28 | 2023-09-29 | 北京百度网讯科技有限公司 | Visual question-answering model training and visual question-answering method, device, equipment and medium |
CN113515615A (en) * | 2021-07-09 | 2021-10-19 | 天津大学 | Visual question-answering method based on capsule self-guide cooperative attention mechanism |
CN113792703A (en) * | 2021-09-29 | 2021-12-14 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention deep modular network |
CN113792703B (en) * | 2021-09-29 | 2024-02-02 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention depth modular network |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114564958B (en) * | 2022-01-11 | 2023-08-04 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and medium |
CN114564958A (en) * | 2022-01-11 | 2022-05-31 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and medium |
CN117274616A (en) * | 2023-09-26 | 2023-12-22 | 南京信息工程大学 | Multi-feature fusion deep learning service QoS prediction system and prediction method |
CN117274616B (en) * | 2023-09-26 | 2024-03-29 | 南京信息工程大学 | Multi-feature fusion deep learning service QoS prediction system and prediction method |
Also Published As
Publication number | Publication date |
---|---|
CN110263912B (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263912A (en) | A kind of image answering method based on multiple target association depth reasoning | |
CN111047548B (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN112766158B (en) | Multi-task cascading type face shielding expression recognition method | |
CN109299701A (en) | Expand the face age estimation method that more ethnic group features cooperate with selection based on GAN | |
CN106778506A (en) | A kind of expression recognition method for merging depth image and multi-channel feature | |
CN111681178B (en) | Knowledge distillation-based image defogging method | |
CN112949647B (en) | Three-dimensional scene description method and device, electronic equipment and storage medium | |
CN111161200A (en) | Human body posture migration method based on attention mechanism | |
CN110046550A (en) | Pedestrian's Attribute Recognition system and method based on multilayer feature study | |
CN109712108A (en) | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
Gupta et al. | Rv-gan: Recurrent gan for unconditional video generation | |
CN114581502A (en) | Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium | |
CN117011883A (en) | Pedestrian re-recognition method based on pyramid convolution and transducer double branches | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
CN114638408A (en) | Pedestrian trajectory prediction method based on spatiotemporal information | |
CN116386102A (en) | Face emotion recognition method based on improved residual convolution network acceptance block structure | |
CN114723784A (en) | Pedestrian motion trajectory prediction method based on domain adaptation technology | |
CN110222568A (en) | A kind of across visual angle gait recognition method based on space-time diagram | |
CN116185182B (en) | Controllable image description generation system and method for fusing eye movement attention | |
CN116311472A (en) | Micro-expression recognition method and device based on multi-level graph convolution network | |
Xu et al. | Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network | |
CN117351382A (en) | Video object positioning method and device, storage medium and program product thereof | |
Pu et al. | Differential residual learning for facial expression recognition | |
CN118093840B (en) | Visual question-answering method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |