CN110134774A

CN110134774A - It is a kind of based on the image vision Question-Answering Model of attention decision, method and system

Info

Publication number: CN110134774A
Application number: CN201910355026.0A
Authority: CN
Inventors: 陈进才; 张胜; 卢萍; 赵伟; 马亚雄; 王少兵; 黄佳宝
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-16
Anticipated expiration: 2039-04-29
Also published as: CN110134774B

Abstract

The invention discloses a kind of based on the image vision Question-Answering Model of attention decision, method and system, belongs to open image vision question and answer field.It include: information Fusion Module, in k=1, amalgamation of global characteristics of image g and problem characteristic vector q to obtain fusion feature vector u_k；In k=2 ..., when K, fusion feature vector u_k‑1And image feature vectorObtain fusion feature vector u_k；Attention decision-making module, for receiving fusion feature vector u_k, decision goes out attention frame L_k, and it is sent to feature extraction pond module；Feature extraction pond module is used for reception space characteristics of image v and attention frame L_k‑1, obtain image feature vectorAnswer reasoning module, for receiving fusion feature vector u_K, infer problem answers.The present invention utilizes intensified learning, the decision process of learning characteristic selection, selection visual signature relevant to problem that can be adaptive.Can train end to end, allow study to feature more there is problem specific aim.

Description

It is a kind of based on the image vision Question-Answering Model of attention decision, method and system

Technical field

The invention belongs to open image vision question and answer fields, more particularly, to a kind of figure based on attention decision As vision Question-Answering Model, method and system.

Background technique

The development of deep learning has pushed the research of many high-level artificial intelligence tasks, for example, vision question and answer (Visual question answering, VQA).Image vision question and answer are in one visual pattern of input and one and image Hold relevant open natural language problem, intelligence system is answered by Understanding image with problem voluntarily to export natural language Case.Vision question and answer can carry out automatic quantitative evaluation, can effectively tracing task development.Because the problem of about picture, is often Tend to find specific visual information, therefore for many problems, answer only includes one to three words, it can be by just The quantity assessment vision question and answer algorithm really answered a question.Fig. 1 gives most of vision Question-Answering Models based on deep learning, it Mainly include following four module: (1) visual information abstracting module: generally using depth convolutional neural networks CNN, the mould of representative Type has AlexNet, VGGNet, GoogLeNet and ResNet etc.；(2) case study module: deep-cycle nerve net is generally used Network RNN, shot and long term memory network, gating cycle unit and convolutional neural networks；(3) multimodal information fusion module: common Method has addition without carry, and step-by-step multiplies, link and bilinearity pond etc.；(4) answer reasoning module: multi-layer perception (MLP) is generally used.

Corresponding answer, therefore problem guiding are found in vision question-answering task according to the relevant range of problem in the picture Image attention power mechanism (Attention mechanism) be solve vision question-answering task important method.Attention mechanism Main target be by using local image characteristics, and allow model to the feature of different zones assign different importance come It solves the problems, such as.

" soft attention " (soft attention) method that the prior art all uses is each of image region point With a weight.But having some regions and problem unrelated in image, the weight in these regions should be arranged to zero, and " soft The weight of attention " method distribution is difficult to converge to zero.Therefore some noise informations unrelated with problem can be introduced, influenced last The decision of answer.On the other hand, certain methods will use the image object inspection of pre-training to be based on Object Semanteme reasoning answer Object in survey method detection image, and obtain the feature vector of these objects.It then the use of soft attention mechanism is each Object distributes an attention weight.But this multistage stage treatment method cannot end-to-end training (end to end Train), characteristics of objects is caused not have problem specific aim.

Summary of the invention

In view of the drawbacks of the prior art, it is an object of the invention to solve the problems, such as the prior art and unrelated visual signature shadow Sound answer reasoning, the characteristics of objects learnt do not have problem targetedly technical problem.

To achieve the above object, in a first aspect, the embodiment of the invention provides a kind of, the image based on attention decision is regarded Feel Question-Answering Model, the model includes:

Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image Feature g is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module；

Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module；

Information Fusion Module, it is special in k=1, receiving and merging the global image from visual information abstracting module The problem of levying g and coming from case study module feature vector q, obtains fusion feature vector u_k；Alternatively, in k=2 ..., when K, connect Receive simultaneously fusion feature vector u_k-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector u_k；In k=1 ..., when K-1, fusion feature vector u_kIt is sent to attention decision-making module, in k=K, fusion feature vector u_kHair Give answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree；

Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Module_k, decision goes out attention Frame L_k, and it is sent to feature extraction pond module；

Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note The attention frame L of meaning power decision-making module_k, obtain image feature vector

Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion Module_K, reasoning goes wrong the answering of Q Case.

Specifically, the fusion feature vector u_kIt obtains in the following manner:

Wherein, FC₁、FC₂And FC₃For full Connection Neural Network, operator [,] indicates two vectors of connection.

Specifically, the decision goes out attention frame L_k, it is specific as follows:

h_agent,k+1=RNN (h_agent,k,u_k)

X '=FC₄(h_agent,k+1)

Y '=FC₅(h_agent,k+1)

A '=FC₆(h_agent,k+1)

B '=FC₇(h_agent,k+1)

Wherein, h_agent,kInternal history state when decision secondary for kth, h_agent,0For null vector, RNN is Recognition with Recurrent Neural Network, FC₄、FC₅、FC₆And FC₇For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal state The random number of distribution, decision goes out attention frame position before (x ', y ') is plus makes an uproar, and decision goes out attention frame before (a ', b ') is plus makes an uproar Length and width, decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.

Specifically, in spatial image feature v, centered on (x, y), select a length of a wide for the spy of the rectangular area of b Sign, then pondization operation is carried out to it, obtain one dimensional image feature vector

Specifically, learn adaptive attention decision process using the method for intensified learning.

Second aspect, the embodiment of the invention provides a kind of image vision answering method based on attention decision, the party Method the following steps are included:

S1. the image vision question and answer mould based on attention decision using p training sample training as described in relation to the first aspect Type, whether the reasoning answer for comparing training sample is identical as the label of the training sample, and if they are the same, then the training sample is infused every time Obtained score of anticipating is 1, is 0 otherwise；

S2. it is based on r_ijLoss function L is constructed, using loss function L as objective function, optimizes network parameter, after being trained The image vision Question-Answering Model based on attention decision, r_ijFor the score that j-th of sample i-th pays attention to, j=1 ..., P, i=1 ..., K；

S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained To final result.

Specifically, optimized using batch stochastic gradient descent.

Specifically, the calculation formula of the loss function L is as follows:

Wherein, log (π_θ(x, y, a, b)) expression calculating state be h_agent,i+1, decision be (x, y, a, b) when loss.

The third aspect, the embodiment of the invention provides a kind of image vision question answering systems based on attention decision, this is System is using the image vision answering method based on attention decision described in above-mentioned second aspect.

Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Computer program is stored in matter, which realizes described in above-mentioned second aspect when being executed by processor based on attention The image vision answering method of power decision.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

1. the present invention utilizes intensified learning, the decision process of learning characteristic selection, selection and problem phase that can be adaptive The visual signature of pass.Regard the attention change adjustment process on image as a serializing decision process, that is, continuously Image area characteristics are selected, these features are then merged, release answer.It can train end to end, the feature for allowing study to arrive is more Add with problem specific aim.

2. a region relevant to problem on the image of selection every time of the invention, records the information in this region, ignore and The unrelated information of problem, can effective exclusive PCR information.

Detailed description of the invention

Fig. 1 is the vision Question-Answering Model structural schematic diagram in the prior art based on deep learning；

Fig. 2 is a kind of image vision Question-Answering Model structural schematic diagram based on intensified learning provided in an embodiment of the present invention；

Fig. 3 is feature extraction pond provided in an embodiment of the present invention process schematic；

Fig. 4 is a kind of image vision answering method flow chart based on intensified learning provided in an embodiment of the present invention；

Fig. 5 is attention decision process schematic diagram provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

As shown in Fig. 2, a kind of image vision Question-Answering Model based on attention decision, the model include:

Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Module_k, decision goes out attention Frame, and it is sent to feature extraction pond module；

Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note The attention frame of meaning power decision-making module, obtains image feature vector

1. visual information abstracting module

In the present invention, visual information abstracting module extracts the global image feature g and sky of image I using convolutional neural networks Between characteristics of image v.

Input picture is pre-processed, input picture I is unified for 3 channels, size is the image of 244*244 I′。

The the convolutional neural networks feature number of plies the deep more has classification distinguishing ability, and the feature vector of convolutional layer output is protected Space distribution information is stayed.Therefore, the characteristic pattern v of the last one convolutional layer of convolutional neural networks is extracted^d×m×nIt is complete with the last one The feature g of articulamentum is as characteristics of image.

G=CNN_fc(I′)

v^d×m×n=CNN_conv(I′)

Wherein, d indicates the number of convolution kernel, and m × n indicates characteristics of image map space size.The embodiment of the present invention is preferred VGG16, m=n=14, d=2048.

2. case study module

In the present invention, case study module extracts asking for problem Q using the insertion of glove word and gating cycle unit networks Inscribe feature vector q.

Each of problem word progress one-hot is encoded into c first_iIf length is greater than N, deletion exceeds Part, curtailment, using sky (null) be filled into specified length.The preferred N=14 of the embodiment of the present invention.

Q '={ c₁,c₂,c₃,…,c_N}

Using 300 term vector of glove of pre-training, by the one-hot of each word it is encoded translated be 300 dimensions word to Measure Q_we。

Q_we=WE (Q)={ w₁,w₂,…,w_N}∈R^N×300

Wherein, WE indicates the function for converting word to term vector, w_iIt is the corresponding term vector of each word.

By w_iIt is input in gating cycle unit networks (Gated Recurrent Unit, GRU), obtains according to sequencing To representation vector q.

Q=GRU (Q_we)

3. information Fusion Module

In the present invention, information Fusion Module is by full Connection Neural Network FC₁、FC₂And FC₃Composition, their activation primitive are ReLu function, but weight parameter is had any different.The effect of full articulamentum is to carry out Nonlinear Mapping to data, extracts abstract spy Sign.Herein, k=1, FC₁Handle global image feature g, FC₂Handle problem characteristic vector q；K > 1, FC₁Handle characteristics of image to Amount, FC₂Handle fusion feature vector；FC₃Two input feature vectors are merged to obtain fusion feature.

Wherein, operator [,] indicates two vectors of connection.

4. attention decision-making module

In the present invention, attention decision-making module uses Recognition with Recurrent Neural Network RNN, FC₄、FC₅、FC₆And FC₇, decision attention out Power frame position (x, y) and length and width (a, b).h_agent,0For null vector.

h_agent,k+1=RNN (h_agent,k,u_k)

Wherein, h_agent,kFor current internal historic state.

X '=FC₄(h_agent,k+1)

Y '=FC₅(h_agent,k+1)

A '=FC₆(h_agent,k+1)

B '=FC₇(h_agent,k+1)

Wherein, FC₄、FC₅、FC₆And FC₇Activation primitive is tanh activation primitive (tanh), but weight parameter has area Not.

X=x '+φ₁

Y=y '+φ₂

A=a '+φ₃

B=b '+φ₄

φ₁、φ₂、φ₃And φ₄It is 0 to be all satisfied mean value, the random number for the normal distribution that variance is 1.For the result of decision Random noise is added, increases the search capability of model, helps to search optimal solution.-1≤x≤1,-1≤y≤1,0≤a≤1, 0≤b≤1。

5. feature extraction pond module

In the present invention, feature extraction pond module is according to fusion feature vector u_kImage is selected from spatial image feature v Feature vectorAs shown in figure 3, in spatial image feature v^2048×14×14In, centered on (x, y), select a length of a wide for b Rectangular area feature, then it is carried out mean value pondization operation, obtain one-dimensional feature vector.It is as follows that feature obtains operation:

v^2048×a×b=selector (v^2048×m×n,x,y,a,b)

Wherein, AP (Average pooling) indicates the operation of mean value pondization.

6. answer reasoning module

In the present invention, answer reasoning module uses MLP (Multi-layer perceptron, Multi-layer Perceptron) reasoning is gone wrong the answer of Q.

Reasoning process is as follows:

H=FC₉(FC₈(u₃))

Wherein, h is candidate answers scores vector, and i is the index of candidate answers, h_iIt is i-th of score in vector h,It is institute There is the index of answer mid-score maximum answer, model is according to indexCorresponding answer is found in candidate answers set.Use multilayer Perceptron carries out answer reasoning.Multi-layer perception (MLP) includes FC₈And FC₉Two full articulamentums.FC₈Activation primitive be ReLu, FC₉ It does not include activation primitive, only with Linear Mapping part.

Learn adaptive attention decision process using the method for intensified learning.Environment gives to be currently used in and push away first Manage the fusion feature of answer.For intelligent body according to current fusion feature, which visual information judgement also requires supplementation with, that is, gives Which position acquisition visual information needed in next step out.Environment is taken out in space characteristics after obtaining position, according to position The visual information for taking out corresponding position, it is merged again to obtain fusion feature with current fuse information, gives it to intelligence again It can body.Iteration for several times after, last fusion feature reasoning answer will be used.If answer is correct, the prize of current decision 1 is given It encourages, otherwise reward is 0.The case where intelligent body is according to reward, changes the strategy of attention decision.

Environment (Enviroment): spatial image feature, feature extraction pond module and information Fusion Module.

Intelligent body (Agent): attention decision-making module.

State (State): fusion feature.

It acts (Action): the position (x, y) of attention frame and long a, width b.

Tactful (Policy): from present fusion feature to the mapping of the position (x, y) of attention frame and long a, width b Journey.Decision function π_θFor the process for calculating attention, θ is the parameter of function.

It awards (Reward): reward functionWhen fusion feature energy logic goes out answer, r_i It is 1, otherwise, r_iIt is 0.

The problem of about image, generally require repeatedly adjustment pay attention to position could accurately find answer a question it is a series of Information.This process is considered as a serializing decision process, that is, continuous selection image area characteristics, then merges These features release answer.The present invention uses intensified learning method, learns attention Facility location process on image.Choosing every time A region relevant to problem on image is selected, the information in this region is recorded.

As shown in figure 4, a kind of image vision answering method based on attention decision, method includes the following steps:

S1. using p image vision Question-Answering Model of the training sample training based on attention decision, each sample training warp It goes through K attention decision and obtains reasoning answer, whether compare reasoning answer identical as the label of the training sample, if they are the same, then K It is secondary to notice that obtained score is 1, it is 0 otherwise；

S2. using loss function L as objective function, optimize network parameter using batch stochastic gradient descent optimal way, obtain The image vision Question-Answering Model based on intensified learning after to training；

Wherein, log (π_θ(x, y, a, b)) expression calculating state be h_agent,i+1, decision be (x, y, a, b) when loss, r_ij The score paid attention to for j-th of sample i-th.

S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on intensified learning, obtained Final result.

The label of training sample represents true answer.If the information that position frame is chosen is unrelated with problem, obtain Answer is necessarily wrong, and therefore, obtained reward is few, and the probability of the position is chosen to reduce next time, therefore, this mechanism it Under, the information unrelated with problem can less participate in calculating.

As shown in figure 5, the image I of input is the scene of a sportsman to play baseball, Q is " what is the problem of input The player ' s number? ".When answering a question, the change procedure of attention.For " what is the player ' s Number? ", model navigates to sportsman first, then navigates to the uniform number of sportsman, finally provides answer " 22 ".

More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of image vision Question-Answering Model based on attention decision, which is characterized in that the model includes:

Information Fusion Module, in k=1, receive and merge global image feature g from visual information abstracting module and The problem of from case study module feature vector q, obtain fusion feature vector u_k；Alternatively, in k=2 ..., when K, receive simultaneously Fusion feature vector u_k-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector u_k；? When k=1 ..., K-1, fusion feature vector u_kIt is sent to attention decision-making module, in k=K, fusion feature vector u_kIt is sent to Answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree；

Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from attention The attention frame L of decision-making module_k, obtain image feature vector

Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion Module_K, reasoning goes wrong the answer of Q.

2. model as described in claim 1, which is characterized in that the fusion feature vector u_kIt obtains in the following manner:

3. model as claimed in claim 1 or 2, which is characterized in that the decision goes out attention frame L_k, it is specific as follows:

h_{Agent, k+1}=RNN (h_{Agent, k}, u_k)

X '=FC₄(h_{Agent, k+1})

Y '=FC₅(h_{Agent, k+1})

A '=FC₆(h_{Agent, k+1})

B '=FC₇(h_{Agent, k+1})

Wherein, h_{Agent, k}Internal history state when decision secondary for kth, h_{Agent, 0}For null vector, RNN is Recognition with Recurrent Neural Network, FC₄、 FC₅、FC₆And FC₇For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal distribution Random number, (x ', y ') is plus decision goes out attention frame position before making an uproar, and decision goes out attention frame length and width before (a ', b ') is plus makes an uproar, Decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.

4. model as claimed in claim 3, which is characterized in that in spatial image feature v, centered on (x, y), selection length For the feature of the rectangular area for being b a wide, then pondization operation is carried out to it, obtain one dimensional image feature vector

5. such as the described in any item models of Claims 1-4, which is characterized in that learnt using the method for intensified learning adaptive Attention decision process.

6. a kind of image vision answering method based on attention decision, which is characterized in that method includes the following steps:

S1. using p training sample training such as the image vision described in any one of claim 1 to 5 based on attention decision Question-Answering Model, whether the reasoning answer for comparing training sample is identical as the label of the training sample, if they are the same, then the training sample Notice that obtained score is 1 every time, be 0 otherwise:

S2. it is based on r_ijConstruct loss function L, using L as objective function, optimize network parameter, after being trained based on attention The image vision Question-Answering Model of decision, r_ijFor the score that j-th of sample i-th pays attention to, j=1 ..., p, i=1 ..., K；

S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained most Whole answer.

7. method as claimed in claim 6, which is characterized in that optimized using batch stochastic gradient descent.

8. method as claimed in claim 6, which is characterized in that the calculation formula of the loss function L is as follows:

Wherein, log (π_θ(x, y, a, b)) expression calculating state be h_{Agent, i+1}, decision be (x, y, a, b) when loss.

9. a kind of image vision question answering system based on attention decision, which is characterized in that the system uses such as claim 6-8 Described in any item image vision answering methods based on attention decision.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor and is based on attention decision as claim 6-8 is described in any item Image vision answering method.