CN110263912A

CN110263912A - A kind of image answering method based on multiple target association depth reasoning

Info

Publication number: CN110263912A
Application number: CN201910398140.1A
Authority: CN
Inventors: 余宙; 俞俊; 汪亮
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-09-20
Anticipated expiration: 2039-05-14
Also published as: CN110263912B

Abstract

The invention discloses a kind of image answering methods based on multiple target association depth reasoning.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the adaptability attention modular model based on the enhancing of candidate frame geometrical characteristic, the attention mechanism for carrying out each target reorders.3, the neural network structure based on AAM model.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for image question and answer, especially propose that a kind of pair of image-question text data carry out unified Modeling, it is made inferences in each target signature in the picture, the method for reordering more accurately to answer problem to the attention mechanism of each target, and the acquisition better effects in image question and answer field.

Description

A kind of image answering method based on multiple target association depth reasoning

Technical field

The present invention relates to a kind of depth nerves for image question and answer (Visual Question Answering) task Network structure more particularly to a kind of pair of image-question and answer data carry out unified Modeling, find each substance feature and phase in image Interaction relationship between corresponding spatial position geometrical characteristic reaches adaptation by modeling to the positional relationship between them Property adjustment attention weight method.

Background technique

Image question and answer are the emerging tasks of a cross computer vision and natural language processing.The task is intended to pass through A problem related to image is given, allows machine that can answer corresponding answer automatically.With another computer vision and natural language Speech processing intersection task --- iamge description is compared, need machine can be by understanding image and problem and reasoning obtains just It is true as a result, therefore image question-answering task is undoubtedly increasingly complex.Such as " what color her glasses are? " such sentence includes Semantic information abundant.Machine is in order to answer this problem, it is necessary first to navigate to the area where women eye in the picture Then domain is answered according to " color " this keyword.For another example " beard be by what make? " this problem, machine need Want can not to directly find beard position, but can according to the location estimation of face to beard should region and right It is paid close attention in the region.Then this problem is answered according to keyword " production ".

With the rapid development of deep learning in recent years, depth convolutional neural networks (Convolutional is used Neural Networks, CNN) or deep-cycle neural network (Recurrent Neural Networks, RNN) held The mainstream research direction of current computer vision and natural language processing field is modeled as to end (end-to-end).? In the research process of image question and answer algorithm, the thought of end-to-end modeling is introduced, while network structure appropriate is used to image Carry out end-to-end modeling, allow computer according to input the problem of and image answers automatically be one be worth further investigate research Problem.

For many years, it is closed in the association being fully recognized that in computer vision field between contextual information or object System facilitates the enhancing of model.But most of methods using the information all deep learning it is universal before.Current depth The study epoch are spent, are not made substantial progress using the relation information between object, especially image question and answer domain variability, mostly Counting method, which is still absorbed in respectively to apply entity, to be paid close attention to.Since object has two-dimensional spatial location and scale/vertical in image The variations such as horizontal ratio, and the correlation that image Question-Answering Model needs to rely between entity makes inferences problem.Therefore the position of object Confidence breath namely geometrical characteristic in general sense play complicated and important role in image Question-Answering Model.

In terms of practical application, image question and answer algorithm has a wide range of applications scene.(such as with wearable Intelligent hardware The HoloLens of Google glasses and Microsoft) and the fast development of augmented reality be based in the near future The picture material automatically request-answering system of visual perception may become a kind of important way of human-computer interaction.This technology can To help us, especially those disabled persons visually impaired preferably perceive and understand the world

In conclusion the image question and answer algorithm based on end-to-end modeling is the direction for being worth further investigation, this project The difficulties incision of quasi- keys several from the task, solves the problems, such as that current method exists, and ultimately form complete set Image question answering system.

Since the picture material under natural scene is complicated, main body multiplicity；Description freedom degree based on natural language is high, this So that picture material description faces huge challenge.Specifically, being primarily present following both sides difficult point:

(1) how progress effectively feature extraction across media data to image-problem: feature extraction problem is across matchmaker Body surface is up to a classics in research direction and the problem on basis, common method have histograms of oriented gradients (Histogram of Oriented Gradient, HOG), local binary patterns (Local Binary Pattern, LBP), the images such as Haar feature Processing feature extracting method.In addition, ResNet, GoogLeNet, Faster-RCNN model based on deep learning theory The feature of extraction is as excellent in played in image fine grit classification, natural language processing, recommender system in many fields Effect.Therefore, suitable strategy is selected simultaneously in the high efficiency for guaranteeing to calculate to mention in across media data feature extraction The ability to express of high feature is the direction for being worth further investigation.

(2) how to rely on the correlation in image between entity to make inferences problem: the input of image question and answer algorithm is Image and problem, image may possess multiple target entities.Algorithm should in abstract image each target entity feature, to figure There emerged a mesh using the geometrical characteristic and visual signature reasoning of target signature as each target is correctly understood, while also Connection between mark.Therefore, how to allow algorithm to learn the connection between each target of image automatically, formed more precisely across matchmaker Body expression characteristic is the difficulties in image question and answer algorithm, while being also the vital ring for influencing arithmetic result performance Section.

Summary of the invention

The present invention provides a kind of image answering methods based on multiple target association depth reasoning.One kind is asked for image The deep neural network framework of (Visual Question Answering) task is answered, the present invention mainly includes two o'clock: 1, adopting Characteristics of image that is stronger with ability to express and having geological information.2, using the target signature in image to target each in image Between relationship make inferences.

The present invention solves technical solution used by its technical problem and includes the following steps:

Step (1), data prediction extract feature to image and text data

It is to image preprocessing first:

Use the target entity for including in Faster-RCNN deep neural network structure detection image.Image zooming-out is regarded Feel the geometrical characteristic G in feature V and image comprising each target size, coordinate information.

Text data is pre-processed:

The sentence length of given problem text is counted according to the maximum length of statistical information offering question text.Building The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by question text vocabulary dictionary, thus Vector q is converted by question text.

Step (2), the attention power module based on the enhancing of candidate frame geometrical characteristic

Its structure as shown in Fig. 2, geometrical characteristic G, visual signature V for three feature candidate frame positions of input and Attention weight vectors m.

Sequential encoding is carried out to attention weight vectors m first, after converting vector according to weight size order for it, Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output is normalized by layer (LayerNormalization) processing obtains V_A。

Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layer_R.By V_AAnd G_RInput is waited Frame relationship component (RelationModule) is selected to make inferences to obtain O_relation, as shown in Figure 1.By O_relationBy linear layer It is multiplied to obtain new attention force vector with sigmoid function with original attention weight vectors m

Step (3), building deep neural network

Its structure is as shown in figure 3, will be converted to index value vector according to vocabulary dictionary in question text first.Then will The vector is passed to shot and long term memory network (Long Short Term Memory, LSTM) by High Dimensional Mapping, outputs it The vector q and visual signature V obtained using Faster R-CNN is melted by way of Hadamard product (Hadamard product) It closes, and by noticing that power module obtains the attention weight m of each substance feature.By attention weight m, visual signature V and Geometrical characteristic G input pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances Module, AAM), it is made inferences using the geometrical characteristic of visual signature and candidate frame position, attention weight is reset Sequence obtains new attention force vectorIt will pay attention to force vectorWeighted average is done after merging with visual signature V product to obtain newly Visual signatureBy visual signatureIt is merged with question text vector q by Hadamard product and is generated by softmax function Probability, and using this probability output as the output predicted value of network.

Step (4), model training

According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (3) The model parameter of middle neural network is trained, until whole network model is restrained.

Step (1) is implemented as follows:

1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, mentions The feature taken includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v₁, v₂..., v_k, G ={ g₁, g₂..., g_k, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is g_i={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated in image Abscissa, ordinate and the width and height of candidate frame where entity；

1-2. records given problem text, first word different in statistical data concentration problem text In dictionary.The word in word list is converted to index value according to word dictionary, so that question text is converted to fixation The index vector of length, specific formula is as follows:

WhereinIt is word w_kIndex value in dictionary, l indicate the length of question text.

Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, has Body is as follows:

2-1. is first handled the attention weight vectors m of input.By each target attention weight m { m in m₁, m₂..., m_kValue sequence serial number pos encoded,Itself specific formula is as follows:

WhereinI ∈ [0,1 ..., d/2], pos ∈ [1,2 ..., k], obtains the square based on attention weight m Battle array

Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and output is normalized by layer Processing obtains V_A, specific formula is as follows:

V_A=Layer Norm (W_PEPE^T+W_VV^T) (formula 3)

Wherein

2-3. is associated calculating to geometrical characteristic G, passes it through linear layer and obtains G_R, specific formula is as follows:

G_R=W_GΩ(G)^T(formula 4)

Wherein, m, n ∈ [1,2 ..., k] GE are encoded using formula (2),

2-4. is by V_AAnd G_RInput relating module makes inferences to obtain O_relation, specific formula is as follows:

O_relation=softmax (log (G_R)+V_R)·(W_OV_A+b_O) (formula 7)

Wherein

2-5. is by O_relationAfter full articulamentum, it is multiplied using sigmoid function with original attention weight m Obtain new attention force vectorSpecific formula is as follows:

Wherein

Building deep neural network described in step (3), specific as follows:

By the linear transformation of full articulamentum, to map to public space right by question text vector q and visual signature V by 3-1. Afterwards using Hadamard product fusion, F_fusionIndicate the fusion feature on public space.W_rAnd W_qRespectively indicate visual signature V and The correspondence that current state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product.W_m The full connection layer parameter of attention weight distribution is indicated fusion feature dimensionality reduction and generates, Initial attention weight vectors m, j indicate currently to calculate j-th of region attention weight.Specific formula is as follows:

M=softmax (W_mF_fusion+b_m) (formula 10)

M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-2. Block is made inferences using the feature of V and G, is reordered to m, and new attention feature is obtained

3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is such as Under:

Training pattern described in step (4), specific as follows:

Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct times It answers.Highest poll is considered as unique correct answer by previous image Question-Answering Model, and carries out one-hot coding (one-hot to it encoding).Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll Weight of the correct option in whole correct options.And letter is lost using Kullback-Leibler divergence Number, if N indicates to answer the length of vocabulary.Predict indicates that distribution of forecasting value, GT indicate true value.Then definition is as shown:

The present invention has the beneficial effect that:

The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture Reasoning, the method for reordering that more accurately image is described to the attention mechanism of each target.The present invention draws for the first time Enter in image imply geometrical characteristic and to its structuring, make itself and in image substance feature carry out Cooperative Reasoning, be existing Vision question and answer technology combine after can effectively improve the accuracy rate of vision Question-Answering Model.

Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, be conducive to be deployed in Deposit limited specific hardware.

Detailed description of the invention

Fig. 1: candidate frame relationship component (Relation Module)

Fig. 2: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to power module

Fig. 3: the adaptability based on the enhancing of candidate frame geometrical characteristic pays attention to the image question and answer neural network framework of power module

Specific embodiment

Detail parameters of the invention are further elaborated with below.

As shown in Figure 1, the present invention provides a kind of depth for being directed to image question and answer (Visual Question Answering) Neural network framework.

Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:

Feature extraction of the 1-1. for image data, we use MS-COCO data set as trained and test data, And utilize existing its visual signature of Faster-RCNN model extraction.Specifically, we are input to image data In Faster-RCNN network, using 10~100 targets in Faster-RCNN model inspection image and outline, to each mesh Target image zooming-out 2048 ties up visual signature V, and the coordinate and size { x, y, w, h } that record the frame of each icon are as the mesh Target geometrical characteristic G, wherein V={ v₁, v₂..., v_k, G={ g₁, g₂..., g_k, k ∈ [10,100].

Always is there is question text, first word different in statistical data concentration problem text by 1-2. in text 9847 words of all words by word frequency higher than 5 are recorded in dictionary.

1-3. only takes preceding 16 words to each problem sentence, supplements null character if problem sentence is discontented with 16 words. The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value Change, so that each problem is converted to 16 word index vectors.

Step (2) pays attention to power module (Adaptive Attention based on the adaptability that candidate frame geometrical characteristic enhances Module, AAM) model is between the target signature V and geometrical characteristic G of image to being learnt and be associated with to the original of input Begin to notice that force information m reorders, specific as follows:

2-1. is first handled the attention weight vectors m of input, and each target in m is paid attention to information { m₁, m₂..., m_kValue sequence serial number pos encoded, obtain based on pay attention to force information m matrix

PE is mapped to 128 dimensions and is added with the V for being equally mapped to 128 dimensions by 2-2., and output is obtained by layer normalized The matrix V for being 100x128 to size_A。

2-3., which is associated feature G, to be calculated and obtains the matrix of 100x100x64 dimension by formula (2) coding, by the square The last of battle array one-dimensional is mapped as obtaining the matrix G of 100x100 dimension by activation primitive ReLU after single value_R。

2-4. is by V_AAnd G_RInput association (Relation) module makes inferences, first by V_AIn the mapping of each clarification of objective To 128 dimensions, the mutual dot product of target signature is obtained into the matrix V of 100x100 later_R.According to V_RWith G_RCombined calculation obtains The matrix of 100x100 and and V_AIn each target take weighted average to obtain the matrix O of 100x128_relation。

2-5. is by O_relationAfter full articulamentum, it is multiplied to obtain 100 new dimensions with original m into sigmoid is crossed

Building deep neural network described in step (3), specific as follows:

For 3-1. for question text feature, text input here is the 16 dimension index value vectors that step (1) generates, I Each word index is converted into corresponding term vector using word embedding technology, the word that we use here to Measuring size is 1024.Therefore each question text becomes the matrix that size is 16x1024.And by the visual signature zero padding of input For the matrix of 100x2048, the matrix of 100x1024 is mapped as using linear layer, we are by the word at each moment later Input of the vector as LSTM, wherein LSTM is a kind of Recognition with Recurrent Neural Network structure, we output it be set as 1024 dimensions to Measure q.

We notice that power module obtains 100 preliminary dimension attention feature m to the output vector q input of LSTM to 3-2., until This, image attention point information extraction (Attention) operation is completed.

M, V and G are inputted the adaptability attention mould enhanced based on candidate frame geometrical characteristic according to step (2) by 3-3. Block (Adaptive Attention Module, AAM) model, is made inferences using the feature of V and G, is reset to m Sequence obtains the attention feature of 100 new dimensionsSo far, between the related reasoning target in image and to focus (attention) operation reordered is completed.

3-4. passes through 100 dimensional vectorsWeighted average, which is done, with the feature V of 100x1024 dimension obtains the band attention of 1024 dimensions The visual signature of power

3-5. we by after the reordering of above-mentioned generation with the visual signature for paying attention to force informationOutput with LSTM to Amount q is merged, and is successively operated by FC layers and softmax, and wherein FC is the full attended operation of neural network, final output Word 9487 dimension predicted vectors, wherein in the output each element representation predict the corresponding answer of the element index be to Determine the probability value of the answer of problem.

Training pattern described in step (4), specific as follows:

For 9487 dimensional vector of prediction that step (3) generate, we compare the correct option of itself and the problem, lead to The difference between predicted value and practical right value is calculated to form penalty values, and root in the loss function for crossing our definition According to the penalty values using BP algorithm adjustment whole network parameter value so that the network generate prediction with actual value it Between gap be gradually reduced, until network convergence.

Claims

1. a kind of image answering method based on multiple target association depth reasoning, it is characterised in that include the following steps:

Step (1), data prediction extract feature to image and text data

It is to image preprocessing first:

Use the target entity for including in Faster-RCNN deep neural network structure detection image；It is special to image zooming-out vision Levy the geometrical characteristic G in V and image comprising each target size, coordinate information；

Text data is pre-processed:

The sentence length of given problem text is counted according to the maximum length of statistical information offering question text；Construct question text The word of problem is replaced with the index value in description vocabulary dictionary, then passes through LSTM by this vocabulary dictionary, thus by problem text Originally it is converted into vector q；

Geometrical characteristic G, visual signature V and attention weight vectors vector m for three feature candidate frame positions of input；

Sequential encoding is carried out to attention weight vectors vector m first to reflect after converting vector according to weight size order for it Be mapped to it is high-dimensional be equally mapped to high-dimensional visual signature V and be added, output obtains V by layer normalized_A；

Then G is obtained by activation primitive ReLU after geometrical characteristic G being mapped by linear layer_R；By V_AAnd G_RCandidate frame is inputted to close Module makes inferences to obtain O_relation, by O_relationBy linear layer and sigmoid function and original attention weight to Amount vector m is multiplied to obtain new attention weight vectors vector

Step (3), building deep neural network

Index value vector will be converted to according to vocabulary dictionary in question text first；Then the vector is passed to by High Dimensional Mapping Shot and long term memory network (Long Short Term Memory, LSTM), the vector q output it and use Faster R-CNN The visual signature V of acquisition is merged by way of Hadamard product (Hadamard product), and by noticing that power module obtains The attention weight vectors m of each substance feature；Attention weight vectors m, visual signature V and geometrical characteristic G input are based on The adaptability of candidate frame geometrical characteristic enhancing pays attention to power module, is pushed away using the geometrical characteristic of visual signature and candidate frame position Reason, reorders to attention weight vectors, obtains new attention weight vectorsBy attention weight vectorsWith view Weighted average is done after feel feature V product fusion and obtains new visual signature V, by visual signaturePass through with question text vector q Hadamard product fusion generates probability by softmax function, and using this probability output as the output predicted value of network；

Step (4), model training

According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to refreshing in step (3) Model parameter through network is trained, until whole network model is restrained.

2. a kind of image answering method based on multiple target association depth reasoning according to claim 1, it is characterised in that Step (1) is implemented as follows:

1-1. carries out feature extraction to image i, extracts feature using existing deep neural network Faster-RCNN, extraction Feature includes the visual signature V and geometrical characteristic G for k target for including in image, wherein V={ v₁, v₂..., v_k, G= {g₁, g₂..., g_k, the vision vector of k ∈ [10,100] and single target isThe geometrical characteristic of single target is g_i ={ x, y, w, h }, whereinWherein x, y, w, h are the location parameter of geometrical characteristic, are respectively indicated real in image Abscissa, ordinate and the width and height of candidate frame where body；

1-2. is recorded in word for given problem text, first word different in statistical data concentration problem text In allusion quotation；The word in word list is converted to index value according to word dictionary, so that question text is converted to regular length Index vector, specific formula is as follows:

WhereinIt is word w_kIndex value in dictionary, I indicate the length of question text.

3. a kind of image answering method based on multiple target association depth reasoning according to claim 2, it is characterised in that Adaptability based on the enhancing of candidate frame geometrical characteristic described in step (2) pays attention to power module depth reasoning network, specific as follows:

2-1. is first handled the attention weight vectors vector m of input；By each target attention weight vectors m in m {m₁, m₂..., m_kValue sequence serial number pos encoded,Itself specific formula is as follows:

WhereinPos ∈ [1,2 ..., k], obtains the matrix based on attention weight vectors m

Matrix PE is added after different linear layers by 2-2. respectively with visual signature V, and layer normalized is passed through in output Obtain V_A, specific formula is as follows:

V_A=LayerNorm (W_PEPE^T+W_VV^T) (formula 3)

Wherein

G_R=W_GΩ(G)^T(formula 4)

Wherein, m, n ∈ [1,2 ..., k], GE are encoded using formula (2),

Q_relation=softmax (log (G_R)+V_R)·(W_OV_A+b_O) (formula 7)

Wherein

2-5. is by O_relationIt is multiplied using sigmoid function and original attention weight vectors m phase after full articulamentum To new attention weight vectorsSpecific formula is as follows:

Wherein

4. a kind of image answering method based on multiple target association depth reasoning according to claim 3, it is characterised in that Building deep neural network described in step (3), specific as follows:

Question text vector q and visual signature V is mapped to public space by the linear transformation of full articulamentum and then made by 3-1. It is merged with Hadamard product, F_fusionIndicate the fusion feature on public space；W_rAnd W_qIt respectively indicates visual signature V and current shape The correspondence that state information q carries out linear transformation connects layer parameter, symbol entirelyIndicate two matrixes using Hadamard product；W_mIndicating will Fusion feature dimensionality reduction and the full connection layer parameter for generating the distribution of attention weight vectors,Initial attention weight vectors vector m, j indicate current and calculate jth A region attention weight vectors；Specific formula is as follows:

M=softmax (W_mF_fusion+b_m) (formula 10)

The adaptability that m, V and G input are enhanced based on candidate frame geometrical characteristic is paid attention to power module according to step (2) by 3-2., benefit It is made inferences with the feature of V and G, is reordered to m, obtain new attention feature

3-3. passing throughWith the obtained visual feature vector of weighted average is done after the feature product of VSpecific formula is as follows:

5. a kind of image answering method based on multiple target association depth reasoning according to claim 4, it is characterised in that Training pattern described in step (4), specific as follows:

Question and answer in VQA-v2.0 data set are answered by more people, therefore the same problem may have different correct answers；First Highest poll is considered as unique correct answer by preceding image Question-Answering Model, and carries out one-hot coding (one-hot to it encoding)；Because correct answer has diversity, therefore vote whole answer of same problem, is determined according to poll Weight of the correct option in whole correct options；And Kullback-Leibler divergence loss function is used, If N indicates to answer the length of vocabulary；Predict indicates that distribution of forecasting value, GT indicate true value；Then definition is as shown: