CN110263912A - A kind of image answering method based on multiple target association depth reasoning - Google Patents

A kind of image answering method based on multiple target association depth reasoning Download PDF

Info

Publication number
CN110263912A
CN110263912A CN201910398140.1A CN201910398140A CN110263912A CN 110263912 A CN110263912 A CN 110263912A CN 201910398140 A CN201910398140 A CN 201910398140A CN 110263912 A CN110263912 A CN 110263912A
Authority
CN
China
Prior art keywords
image
vector
feature
attention
weight vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910398140.1A
Other languages
Chinese (zh)
Other versions
CN110263912B (en
Inventor
余宙
俞俊
汪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910398140.1A priority Critical patent/CN110263912B/en
Publication of CN110263912A publication Critical patent/CN110263912A/en
Application granted granted Critical
Publication of CN110263912B publication Critical patent/CN110263912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of image answering methods based on multiple target association depth reasoning.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the adaptability attention modular model based on the enhancing of candidate frame geometrical characteristic, the attention mechanism for carrying out each target reorders.3, the neural network structure based on AAM model.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for image question and answer, especially propose that a kind of pair of image-question text data carry out unified Modeling, it is made inferences in each target signature in the picture, the method for reordering more accurately to answer problem to the attention mechanism of each target, and the acquisition better effects in image question and answer field.

Description

A kind of image answering method based on multiple target association depth reasoning
Technical field
The present invention relates to a kind of depth nerves for image question and answer (Visual Question Answering) task Network structure more particularly to a kind of pair of image-question and answer data carry out unified Modeling, find each substance feature and phase in image Interaction relationship between corresponding spatial position geometrical characteristic reaches adaptation by modeling to the positional relationship between them Property adjustment attention weight method.
Background technique
Image question and answer are the emerging tasks of a cross computer vision and natural language processing.The task is intended to pass through A problem related to image is given, allows machine that can answer corresponding answer automatically.With another computer vision and natural language Speech processing intersection task --- iamge description is compared, need machine can be by understanding image and problem and reasoning obtains just It is true as a result, therefore image question-answering task is undoubtedly increasingly complex.Such as " what color her glasses are? " such sentence includes Semantic information abundant.Machine is in order to answer this problem, it is necessary first to navigate to the area where women eye in the picture Then domain is answered according to " color " this keyword.For another example " beard be by what make? " this problem, machine need Want can not to directly find beard position, but can according to the location estimation of face to beard should region and right It is paid close attention in the region.Then this problem is answered according to keyword " production ".
With the rapid development of deep learning in recent years, depth convolutional neural networks (Convolutional is used Neural Networks, CNN) or deep-cycle neural network (Recurrent Neural Networks, RNN) held The mainstream research direction of current computer vision and natural language processing field is modeled as to end (end-to-end).? In the research process of image question and answer algorithm, the thought of end-to-end modeling is introduced, while network structure appropriate is used to image Carry out end-to-end modeling, allow computer according to input the problem of and image answers automatically be one be worth further investigate research Problem.
For many years, it is closed in the association being fully recognized that in computer vision field between contextual information or object System facilitates the enhancing of model.But most of methods using the information all deep learning it is universal before.Current depth The study epoch are spent, are not made substantial progress using the relation information between object, especially image question and answer domain variability, mostly Counting method, which is still absorbed in respectively to apply entity, to be paid close attention to.Since object has two-dimensional spatial location and scale/vertical in image The variations such as horizontal ratio, and the correlation that image Question-Answering Model needs to rely between entity makes inferences problem.Therefore the position of object Confidence breath namely geometrical characteristic in general sense play complicated and important role in image Question-Answering Model.
In terms of practical application, image question and answer algorithm has a wide range of applications scene.(such as with wearable Intelligent hardware The HoloLens of Google glasses and Microsoft) and the fast development of augmented reality be based in the near future The picture material automatically request-answering system of visual perception may become a kind of important way of human-computer interaction.This technology can To help us, especially those disabled persons visually impaired preferably perceive and understand the world
In conclusion the image question and answer algorithm based on end-to-end modeling is the direction for being worth further investigation, this project The difficulties incision of quasi- keys several from the task, solves the problems, such as that current method exists, and ultimately form complete set Image question answering system.
Since the picture material under natural scene is complicated, main body multiplicity;Description freedom degree based on natural language is high, this So that picture material description faces huge challenge.Specifically, being primarily present following both sides difficult point:
(1) how progress effectively feature extraction across media data to image-problem: feature extraction problem is across matchmaker Body surface is up to a classics in research direction and the problem on basis, common method have histograms of oriented gradients (Histogram of Oriented Gradient, HOG), local binary patterns (Local Binary Pattern, LBP), the images such as Haar feature Processing feature extracting method.In addition, ResNet, GoogLeNet, Faster-RCNN model based on deep learning theory The feature of extraction is as excellent in played in image fine grit classification, natural language processing, recommender system in many fields Effect.Therefore, suitable strategy is selected simultaneously in the high efficiency for guaranteeing to calculate to mention in across media data feature extraction The ability to express of high feature is the direction for being worth further investigation.
(2) how to rely on the correlation in image between entity to make inferences problem: the input of image question and answer algorithm is Image and problem, image may possess multiple target entities.Algorithm should in abstract image each target entity feature, to figure There emerged a mesh using the geometrical characteristic and visual signature reasoning of target signature as each target is correctly understood, while also Connection between mark.Therefore, how to allow algorithm to learn the connection between each target of image automatically, formed more precisely across matchmaker Body expression characteristic is the difficulties in image question and answer algorithm, while being also the vital ring for influencing arithmetic result performance Section.
Summary of the invention
The present invention provides a kind of image answering methods based on multiple target association depth reasoning.One kind is asked for image The deep neural network framework of (Visual Question Answering) task is answered, the present invention mainly includes two o'clock: 1, adopting Characteristics of image that is stronger with ability to express and having geological information.2, using the target signature in image to target each in image Between relationship make inferences.
The present invention solves technical solution used by its technical problem and includes the following steps:
Step (1), data prediction extract feature to image and text data
It is to image preprocessing first:
Use the target entity for including in Faster-RCNN deep neural network structure detection image.Image zooming-out is regarded Feel the geometrical characteristic G in feature V and image comprising each target size, coordinate information.
Text data is pre-processed:
The sentence length of given problem text is counted according to the maximum length of statistical information offering question text.Building The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by question text vocabulary dictionary, thus Vector q is converted by question text.
Step (2), the attention power module based on the enhancing of candidate frame geometrical characteristic
Its structure as shown in Fig. 2, geometrical characteristic G, visual signature V for three feature candidate frame positions of input and Attention weight vectors m.
Sequential encoding is carried out to attention weight vectors m first, after converting vector according to weight size order for it, Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output is normalized by layer (LayerNormalization) processing obtains VA
Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layerR.By VAAnd GRInput is waited Frame relationship component (RelationModule) is selected to make inferences to obtain Orelation, as shown in Figure 1.By OrelationBy linear layer It is multiplied to obtain new attention force vector with sigmoid function with original attention weight vectors m
Step (3), building deep neural network
Its structure is as shown in figure 3, will be converted to index value vector according to vocabulary dictionary in question text first.Then will The vector is passed to shot and long term memory network (Long Short Term Memory, LSTM) by High Dimensional Mapping, outputs it The vector q and visual signature V obtained using Faster R-CNN is melted by way of Hadamard product (Hadamard product) It closes, and by noticing that power module obtains the attention weight m of each substance feature.By attention weight m, visual signature V and Geometrical characteristic G input pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances Module, AAM), it is made inferences using the geometrical characteristic of visual signature and candidate frame position, attention weight is reset Sequence obtains new attention force vectorIt will pay attention to force vectorWeighted average is done after merging with visual signature V product to obtain newly Visual signatureBy visual signatureIt is merged with question text vector q by Hadamard product and is generated by softmax function Probability, and using this probability output as the output predicted value of network.
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (3) The model parameter of middle neural network is trained, until whole network model is restrained.
Step (1) is implemented as follows:
1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, mentions The feature taken includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v1, v2..., vk, G ={ g1, g2..., gk, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is gi={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated in image Abscissa, ordinate and the width and height of candidate frame where entity;
1-2. records given problem text, first word different in statistical data concentration problem text In dictionary.The word in word list is converted to index value according to word dictionary, so that question text is converted to fixation The index vector of length, specific formula is as follows:
WhereinIt is word wkIndex value in dictionary, l indicate the length of question text.
Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, has Body is as follows:
2-1. is first handled the attention weight vectors m of input.By each target attention weight m { m in m1, m2..., mkValue sequence serial number pos encoded,Itself specific formula is as follows:
WhereinI ∈ [0,1 ..., d/2], pos ∈ [1,2 ..., k], obtains the square based on attention weight m Battle array
Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and output is normalized by layer Processing obtains VA, specific formula is as follows:
VA=Layer Norm (WPEPET+WVVT) (formula 3)
Wherein
2-3. is associated calculating to geometrical characteristic G, passes it through linear layer and obtains GR, specific formula is as follows:
GR=WGΩ(G)T(formula 4)
Wherein, m, n ∈ [1,2 ..., k] GE are encoded using formula (2),
2-4. is by VAAnd GRInput relating module makes inferences to obtain Orelation, specific formula is as follows:
Orelation=softmax (log (GR)+VR)·(WOVA+bO) (formula 7)
Wherein
2-5. is by OrelationAfter full articulamentum, it is multiplied using sigmoid function with original attention weight m Obtain new attention force vectorSpecific formula is as follows:
Wherein
Building deep neural network described in step (3), specific as follows:
By the linear transformation of full articulamentum, to map to public space right by question text vector q and visual signature V by 3-1. Afterwards using Hadamard product fusion, FfusionIndicate the fusion feature on public space.WrAnd WqRespectively indicate visual signature V and The correspondence that current state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product.Wm The full connection layer parameter of attention weight distribution is indicated fusion feature dimensionality reduction and generates, Initial attention weight vectors m, j indicate currently to calculate j-th of region attention weight.Specific formula is as follows:
M=softmax (WmFfusion+bm) (formula 10)
M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-2. Block is made inferences using the feature of V and G, is reordered to m, and new attention feature is obtained
3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is such as Under:
Training pattern described in step (4), specific as follows:
Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct times It answers.Highest poll is considered as unique correct answer by previous image Question-Answering Model, and carries out one-hot coding (one-hot to it encoding).Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll Weight of the correct option in whole correct options.And letter is lost using Kullback-Leibler divergence Number, if N indicates to answer the length of vocabulary.Predict indicates that distribution of forecasting value, GT indicate true value.Then definition is as shown:
The present invention has the beneficial effect that:
The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture Reasoning, the method for reordering that more accurately image is described to the attention mechanism of each target.The present invention draws for the first time Enter in image imply geometrical characteristic and to its structuring, make itself and in image substance feature carry out Cooperative Reasoning, be existing Vision question and answer technology combine after can effectively improve the accuracy rate of vision Question-Answering Model.
Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, be conducive to be deployed in Deposit limited specific hardware.
Detailed description of the invention
Fig. 1: candidate frame relationship component (Relation Module)
Fig. 2: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to power module
Fig. 3: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to the image question and answer neural network framework of power module
Specific embodiment
Detail parameters of the invention are further elaborated with below.
As shown in Figure 1, the present invention provides a kind of depth for being directed to image question and answer (Visual Question Answering) Neural network framework.
Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:
Feature extraction of the 1-1. for image data, we use MS-COCO data set as trained and test data, And utilize existing its visual signature of Faster-RCNN model extraction.Specifically, we are input to image data In Faster-RCNN network, using 10~100 targets in Faster-RCNN model inspection image and outline, to each mesh Target image zooming-out 2048 ties up visual signature V, and the coordinate and size { x, y, w, h } that record the frame of each icon are as the mesh Target geometrical characteristic G, wherein V={ v1, v2..., vk, G={ g1, g2..., gk, k ∈ [10,100].
Always is there is question text, first word different in statistical data concentration problem text by 1-2. in text 9847 words of all words by word frequency higher than 5 are recorded in dictionary.
1-3. only takes preceding 16 words to each problem sentence, supplements null character if problem sentence is discontented with 16 words. The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value Change, so that each problem is converted to 16 word index vectors.
Step (2) pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances Module, AAM) model is between the target signature V and geometrical characteristic G of image to being learnt and be associated with to the original of input Begin to notice that force information m reorders, specific as follows:
2-1. is first handled the attention weight vectors m of input, and each target in m is paid attention to information { m1, m2..., mkValue sequence serial number pos encoded, obtain based on pay attention to force information m matrix
PE is mapped to 128 dimensions and is added with the V for being equally mapped to 128 dimensions by 2-2., and output is obtained by layer normalized The matrix V for being 100x128 to sizeA
2-3., which is associated feature G, to be calculated and obtains the matrix of 100x100x64 dimension by formula (2) coding, by the square The last of battle array one-dimensional is mapped as obtaining the matrix G of 100x100 dimension by activation primitive ReLU after single valueR
2-4. is by VAAnd GRInput association (Relation) module makes inferences, first by VAIn the mapping of each clarification of objective To 128 dimensions, the mutual dot product of target signature is obtained into the matrix V of 100x100 laterR.According to VRWith GRCombined calculation obtains The matrix of 100x100 and and VAIn each target take weighted average to obtain the matrix O of 100x128relation
2-5. is by OrelationAfter full articulamentum, it is multiplied to obtain 100 new dimensions with original m into sigmoid is crossed
Building deep neural network described in step (3), specific as follows:
For 3-1. for question text feature, text input here is the 16 dimension index value vectors that step (1) generates, I Each word index is converted into corresponding term vector using word embedding technology, the word that we use here to Measuring size is 1024.Therefore each question text becomes the matrix that size is 16x1024.And by the visual signature zero padding of input For the matrix of 100x2048, the matrix of 100x1024 is mapped as using linear layer, we are by the word at each moment later Input of the vector as LSTM, wherein LSTM is a kind of Recognition with Recurrent Neural Network structure, we output it be set as 1024 dimensions to Measure q.
We notice that power module obtains 100 preliminary dimension attention feature m to the output vector q input of LSTM to 3-2., until This, image attention point information extraction (Attention) operation is completed.
M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-3. Block (Adaptive Attention Module, AAM) model, is made inferences using the feature of V and G, is reset to m Sequence obtains the attention feature of 100 new dimensionsSo far, between the related reasoning target in image and to focus (attention) operation reordered is completed.
3-4. passes through 100 dimensional vectorsWeighted average, which is done, with the feature V of 100x1024 dimension obtains the band attention of 1024 dimensions The visual signature of power
3-5. we by after the reordering of above-mentioned generation with the visual signature for paying attention to force informationOutput with LSTM to Amount q is merged, and is successively operated by FC layers and softmax, and wherein FC is the full attended operation of neural network, final output Word 9487 dimension predicted vectors, wherein in the output each element representation predict the corresponding answer of the element index be to Determine the probability value of the answer of problem.
Training pattern described in step (4), specific as follows:
For 9487 dimensional vector of prediction that step (3) generate, we compare the correct option of itself and the problem, lead to The difference between predicted value and practical right value is calculated to form penalty values, and root in the loss function for crossing our definition According to the penalty values using BP algorithm adjustment whole network parameter value so that the network generate prediction with actual value it Between gap be gradually reduced, until network convergence.

Claims (5)

1. a kind of image answering method based on multiple target association depth reasoning, it is characterised in that include the following steps:
Step (1), data prediction extract feature to image and text data
It is to image preprocessing first:
Use the target entity for including in Faster-RCNN deep neural network structure detection image;It is special to image zooming-out vision Levy the geometrical characteristic G in V and image comprising each target size, coordinate information;
Text data is pre-processed:
The sentence length of given problem text is counted according to the maximum length of statistical information offering question text;Construct question text The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by this vocabulary dictionary, thus by problem text Originally it is converted into vector q;
Step (2), the attention power module based on the enhancing of candidate frame geometrical characteristic
Geometrical characteristic G, visual signature V and attention weight vectors vector m for three feature candidate frame positions of input;
Sequential encoding is carried out to attention weight vectors vector m first to reflect after converting vector according to weight size order for it Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output obtains V by layer normalizedA
Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layerR;By VAAnd GRCandidate frame is inputted to close Module makes inferences to obtain Orelation, by OrelationBy linear layer and sigmoid function and original attention weight to Amount vector m is multiplied to obtain new attention weight vectors vector
Step (3), building deep neural network
Index value vector will be converted to according to vocabulary dictionary in question text first;Then the vector is passed to by High Dimensional Mapping Shot and long term memory network (Long Short Term Memory, LSTM), the vector q output it and use Faster R-CNN The visual signature V of acquisition is merged by way of Hadamard product (Hadamard product), and by noticing that power module obtains The attention weight vectors m of each substance feature;Attention weight vectors m, visual signature V and geometrical characteristic G input are based on The adaptability of candidate frame geometrical characteristic enhancing pays attention to power module, is pushed away using the geometrical characteristic of visual signature and candidate frame position Reason, reorders to attention weight vectors, obtains new attention weight vectorsBy attention weight vectorsWith view Weighted average is done after feel feature V product fusion and obtains new visual signature V, by visual signaturePass through with question text vector q Hadamard product fusion generates probability by softmax function, and using this probability output as the output predicted value of network;
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to refreshing in step (3) Model parameter through network is trained, until whole network model is restrained.
2. a kind of image answering method based on multiple target association depth reasoning according to claim 1, it is characterised in that Step (1) is implemented as follows:
1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, extraction Feature includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v1, v2..., vk, G= {g1, g2..., gk, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is gi ={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated real in image Abscissa, ordinate and the width and height of candidate frame where body;
1-2. is recorded in word for given problem text, first word different in statistical data concentration problem text In allusion quotation;The word in word list is converted to index value according to word dictionary, so that question text is converted to regular length Index vector, specific formula is as follows:
WhereinIt is word wkIndex value in dictionary, I indicate the length of question text.
3. a kind of image answering method based on multiple target association depth reasoning according to claim 2, it is characterised in that Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, specific as follows:
2-1. is first handled the attention weight vectors vector m of input;By each target attention weight vectors m in m {m1, m2..., mkValue sequence serial number pos encoded,Itself specific formula is as follows:
WhereinPos ∈ [1,2 ..., k], obtains the matrix based on attention weight vectors m
Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and layer normalized is passed through in output Obtain VA, specific formula is as follows:
VA=LayerNorm (WPEPET+WVVT) (formula 3)
Wherein
2-3. is associated calculating to geometrical characteristic G, passes it through linear layer and obtains GR, specific formula is as follows:
GR=WGΩ(G)T(formula 4)
Wherein, m, n ∈ [1,2 ..., k], GE are encoded using formula (2),
2-4. is by VAAnd GRInput relating module makes inferences to obtain Orelation, specific formula is as follows:
Qrelation=softmax (log (GR)+VR)·(WOVA+bO) (formula 7)
Wherein
2-5. is by OrelationIt is multiplied using sigmoid function and original attention weight vectors m phase after full articulamentum To new attention weight vectorsSpecific formula is as follows:
Wherein
4. a kind of image answering method based on multiple target association depth reasoning according to claim 3, it is characterised in that Building deep neural network described in step (3), specific as follows:
Question text vector q and visual signature V is mapped to public space by the linear transformation of full articulamentum and then made by 3-1. It is merged with Hadamard product, FfusionIndicate the fusion feature on public space;WrAnd WqIt respectively indicates visual signature V and current shape The correspondence that state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product;WmIndicating will Fusion feature dimensionality reduction and the full connection layer parameter for generating the distribution of attention weight vectors,Initial attention weight vectors vector m, j indicate current and calculate jth A region attention weight vectors;Specific formula is as follows:
M=softmax (WmFfusion+bm) (formula 10)
The adaptability that m, V and G input are enhanced based on candidate frame geometrical characteristic is paid attention to power module according to step (2) by 3-2., benefit It is made inferences with the feature of V and G, is reordered to m, obtain new attention feature
3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is as follows:
5. a kind of image answering method based on multiple target association depth reasoning according to claim 4, it is characterised in that Training pattern described in step (4), specific as follows:
Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct answers;First Highest poll is considered as unique correct answer by preceding image Question-Answering Model, and carries out one-hot coding (one-hot to it encoding);Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll Weight of the correct option in whole correct options;And Kullback-Leibler divergence loss function is used, If N indicates to answer the length of vocabulary;Predict indicates that distribution of forecasting value, GT indicate true value;Then definition is as shown:
CN201910398140.1A 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning Active CN110263912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910398140.1A CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910398140.1A CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Publications (2)

Publication Number Publication Date
CN110263912A true CN110263912A (en) 2019-09-20
CN110263912B CN110263912B (en) 2021-02-26

Family

ID=67914695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910398140.1A Active CN110263912B (en) 2019-05-14 2019-05-14 Image question-answering method based on multi-target association depth reasoning

Country Status (1)

Country Link
CN (1) CN110263912B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879844A (en) * 2019-10-25 2020-03-13 北京大学 Cross-media reasoning method and system based on heterogeneous interactive learning
CN110889505A (en) * 2019-11-18 2020-03-17 北京大学 Cross-media comprehensive reasoning method and system for matching image-text sequences
CN111553372A (en) * 2020-04-24 2020-08-18 北京搜狗科技发展有限公司 Training image recognition network, image recognition searching method and related device
CN111598118A (en) * 2019-12-10 2020-08-28 中山大学 Visual question-answering task implementation method and system
CN111611367A (en) * 2020-05-21 2020-09-01 拾音智能科技有限公司 Visual question answering method introducing external knowledge
CN111737458A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Intention identification method, device and equipment based on attention mechanism and storage medium
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN112016493A (en) * 2020-09-03 2020-12-01 科大讯飞股份有限公司 Image description method and device, electronic equipment and storage medium
CN112309528A (en) * 2020-10-27 2021-02-02 上海交通大学 Medical image report generation method based on visual question-answering method
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113094484A (en) * 2021-04-07 2021-07-09 西北工业大学 Text visual question-answering implementation method based on heterogeneous graph neural network
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113326933A (en) * 2021-05-08 2021-08-31 清华大学 Attention mechanism-based object operation instruction following learning method and device
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium
CN113515615A (en) * 2021-07-09 2021-10-19 天津大学 Visual question-answering method based on capsule self-guide cooperative attention mechanism
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113792703A (en) * 2021-09-29 2021-12-14 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention deep modular network
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114564958A (en) * 2022-01-11 2022-05-31 平安科技(深圳)有限公司 Text recognition method, device, equipment and medium
CN117274616A (en) * 2023-09-26 2023-12-22 南京信息工程大学 Multi-feature fusion deep learning service QoS prediction system and prediction method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
US20190073353A1 (en) * 2017-09-07 2019-03-07 Baidu Usa Llc Deep compositional frameworks for human-like language acquisition in virtual environments
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
US20190073353A1 (en) * 2017-09-07 2019-03-07 Baidu Usa Llc Deep compositional frameworks for human-like language acquisition in virtual environments
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108108771A (en) * 2018-01-03 2018-06-01 华南理工大学 Image answering method based on multiple dimensioned deep learning
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI,ET AL: "《Attention Is All You Need》", 《ARXIV:1706.03762V5》 *
KAN CHEN,ET AL: "《ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering》", 《ARXIV:1511.05960V2》 *
俞俊,等: "《视觉问答技术研究》", 《计算机研究与发展》 *
李庆: "《基于深度神经网络和注意力机制的图像问答研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879844B (en) * 2019-10-25 2022-10-14 北京大学 Cross-media reasoning method and system based on heterogeneous interactive learning
CN110879844A (en) * 2019-10-25 2020-03-13 北京大学 Cross-media reasoning method and system based on heterogeneous interactive learning
CN110889505A (en) * 2019-11-18 2020-03-17 北京大学 Cross-media comprehensive reasoning method and system for matching image-text sequences
CN110889505B (en) * 2019-11-18 2023-05-02 北京大学 Cross-media comprehensive reasoning method and system for image-text sequence matching
CN111598118A (en) * 2019-12-10 2020-08-28 中山大学 Visual question-answering task implementation method and system
CN111598118B (en) * 2019-12-10 2023-07-07 中山大学 Visual question-answering task implementation method and system
CN111553372A (en) * 2020-04-24 2020-08-18 北京搜狗科技发展有限公司 Training image recognition network, image recognition searching method and related device
CN111553372B (en) * 2020-04-24 2023-08-08 北京搜狗科技发展有限公司 Training image recognition network, image recognition searching method and related device
CN111611367A (en) * 2020-05-21 2020-09-01 拾音智能科技有限公司 Visual question answering method introducing external knowledge
CN111737458A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Intention identification method, device and equipment based on attention mechanism and storage medium
CN111737458B (en) * 2020-05-21 2024-05-21 深圳赛安特技术服务有限公司 Attention mechanism-based intention recognition method, device, equipment and storage medium
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN111897939B (en) * 2020-08-12 2024-02-02 腾讯科技(深圳)有限公司 Visual dialogue method, training method, device and equipment for visual dialogue model
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN112016493A (en) * 2020-09-03 2020-12-01 科大讯飞股份有限公司 Image description method and device, electronic equipment and storage medium
CN112309528A (en) * 2020-10-27 2021-02-02 上海交通大学 Medical image report generation method based on visual question-answering method
CN112309528B (en) * 2020-10-27 2023-04-07 上海交通大学 Medical image report generation method based on visual question-answering method
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113010712B (en) * 2021-03-04 2022-12-02 天津大学 Visual question answering method based on multi-graph fusion
CN113094484A (en) * 2021-04-07 2021-07-09 西北工业大学 Text visual question-answering implementation method based on heterogeneous graph neural network
CN113326933B (en) * 2021-05-08 2022-08-09 清华大学 Attention mechanism-based object operation instruction following learning method and device
CN113326933A (en) * 2021-05-08 2021-08-31 清华大学 Attention mechanism-based object operation instruction following learning method and device
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113220859B (en) * 2021-06-01 2024-05-10 平安科技(深圳)有限公司 Question answering method and device based on image, computer equipment and storage medium
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113392253A (en) * 2021-06-28 2021-09-14 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium
CN113392253B (en) * 2021-06-28 2023-09-29 北京百度网讯科技有限公司 Visual question-answering model training and visual question-answering method, device, equipment and medium
CN113515615A (en) * 2021-07-09 2021-10-19 天津大学 Visual question-answering method based on capsule self-guide cooperative attention mechanism
CN113792703A (en) * 2021-09-29 2021-12-14 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention deep modular network
CN113792703B (en) * 2021-09-29 2024-02-02 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention depth modular network
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114564958B (en) * 2022-01-11 2023-08-04 平安科技(深圳)有限公司 Text recognition method, device, equipment and medium
CN114564958A (en) * 2022-01-11 2022-05-31 平安科技(深圳)有限公司 Text recognition method, device, equipment and medium
CN117274616A (en) * 2023-09-26 2023-12-22 南京信息工程大学 Multi-feature fusion deep learning service QoS prediction system and prediction method
CN117274616B (en) * 2023-09-26 2024-03-29 南京信息工程大学 Multi-feature fusion deep learning service QoS prediction system and prediction method

Also Published As

Publication number Publication date
CN110263912B (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN110263912A (en) A kind of image answering method based on multiple target association depth reasoning
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN112766158B (en) Multi-task cascading type face shielding expression recognition method
CN109299701A (en) Expand the face age estimation method that more ethnic group features cooperate with selection based on GAN
CN106778506A (en) A kind of expression recognition method for merging depth image and multi-channel feature
CN111681178B (en) Knowledge distillation-based image defogging method
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN111161200A (en) Human body posture migration method based on attention mechanism
CN110046550A (en) Pedestrian's Attribute Recognition system and method based on multilayer feature study
CN109712108A (en) It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Gupta et al. Rv-gan: Recurrent gan for unconditional video generation
CN114581502A (en) Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium
CN117011883A (en) Pedestrian re-recognition method based on pyramid convolution and transducer double branches
CN112906520A (en) Gesture coding-based action recognition method and device
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
CN116386102A (en) Face emotion recognition method based on improved residual convolution network acceptance block structure
CN114723784A (en) Pedestrian motion trajectory prediction method based on domain adaptation technology
CN110222568A (en) A kind of across visual angle gait recognition method based on space-time diagram
CN116185182B (en) Controllable image description generation system and method for fusing eye movement attention
CN116311472A (en) Micro-expression recognition method and device based on multi-level graph convolution network
Xu et al. Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network
CN117351382A (en) Video object positioning method and device, storage medium and program product thereof
Pu et al. Differential residual learning for facial expression recognition
CN118093840B (en) Visual question-answering method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant