CN110134774A - It is a kind of based on the image vision Question-Answering Model of attention decision, method and system - Google Patents

It is a kind of based on the image vision Question-Answering Model of attention decision, method and system Download PDF

Info

Publication number
CN110134774A
CN110134774A CN201910355026.0A CN201910355026A CN110134774A CN 110134774 A CN110134774 A CN 110134774A CN 201910355026 A CN201910355026 A CN 201910355026A CN 110134774 A CN110134774 A CN 110134774A
Authority
CN
China
Prior art keywords
decision
attention
module
image
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910355026.0A
Other languages
Chinese (zh)
Other versions
CN110134774B (en
Inventor
陈进才
张胜
卢萍
赵伟
马亚雄
王少兵
黄佳宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910355026.0A priority Critical patent/CN110134774B/en
Publication of CN110134774A publication Critical patent/CN110134774A/en
Application granted granted Critical
Publication of CN110134774B publication Critical patent/CN110134774B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of based on the image vision Question-Answering Model of attention decision, method and system, belongs to open image vision question and answer field.It include: information Fusion Module, in k=1, amalgamation of global characteristics of image g and problem characteristic vector q to obtain fusion feature vector uk;In k=2 ..., when K, fusion feature vector uk‑1And image feature vectorObtain fusion feature vector uk;Attention decision-making module, for receiving fusion feature vector uk, decision goes out attention frame Lk, and it is sent to feature extraction pond module;Feature extraction pond module is used for reception space characteristics of image v and attention frame Lk‑1, obtain image feature vectorAnswer reasoning module, for receiving fusion feature vector uK, infer problem answers.The present invention utilizes intensified learning, the decision process of learning characteristic selection, selection visual signature relevant to problem that can be adaptive.Can train end to end, allow study to feature more there is problem specific aim.

Description

It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
Technical field
The invention belongs to open image vision question and answer fields, more particularly, to a kind of figure based on attention decision As vision Question-Answering Model, method and system.
Background technique
The development of deep learning has pushed the research of many high-level artificial intelligence tasks, for example, vision question and answer (Visual question answering, VQA).Image vision question and answer are in one visual pattern of input and one and image Hold relevant open natural language problem, intelligence system is answered by Understanding image with problem voluntarily to export natural language Case.Vision question and answer can carry out automatic quantitative evaluation, can effectively tracing task development.Because the problem of about picture, is often Tend to find specific visual information, therefore for many problems, answer only includes one to three words, it can be by just The quantity assessment vision question and answer algorithm really answered a question.Fig. 1 gives most of vision Question-Answering Models based on deep learning, it Mainly include following four module: (1) visual information abstracting module: generally using depth convolutional neural networks CNN, the mould of representative Type has AlexNet, VGGNet, GoogLeNet and ResNet etc.;(2) case study module: deep-cycle nerve net is generally used Network RNN, shot and long term memory network, gating cycle unit and convolutional neural networks;(3) multimodal information fusion module: common Method has addition without carry, and step-by-step multiplies, link and bilinearity pond etc.;(4) answer reasoning module: multi-layer perception (MLP) is generally used.
Corresponding answer, therefore problem guiding are found in vision question-answering task according to the relevant range of problem in the picture Image attention power mechanism (Attention mechanism) be solve vision question-answering task important method.Attention mechanism Main target be by using local image characteristics, and allow model to the feature of different zones assign different importance come It solves the problems, such as.
" soft attention " (soft attention) method that the prior art all uses is each of image region point With a weight.But having some regions and problem unrelated in image, the weight in these regions should be arranged to zero, and " soft The weight of attention " method distribution is difficult to converge to zero.Therefore some noise informations unrelated with problem can be introduced, influenced last The decision of answer.On the other hand, certain methods will use the image object inspection of pre-training to be based on Object Semanteme reasoning answer Object in survey method detection image, and obtain the feature vector of these objects.It then the use of soft attention mechanism is each Object distributes an attention weight.But this multistage stage treatment method cannot end-to-end training (end to end Train), characteristics of objects is caused not have problem specific aim.
Summary of the invention
In view of the drawbacks of the prior art, it is an object of the invention to solve the problems, such as the prior art and unrelated visual signature shadow Sound answer reasoning, the characteristics of objects learnt do not have problem targetedly technical problem.
To achieve the above object, in a first aspect, the embodiment of the invention provides a kind of, the image based on attention decision is regarded Feel Question-Answering Model, the model includes:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image Feature g is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, it is special in k=1, receiving and merging the global image from visual information abstracting module The problem of levying g and coming from case study module feature vector q, obtains fusion feature vector uk;Alternatively, in k=2 ..., when K, connect Receive simultaneously fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector uk;In k=1 ..., when K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukHair Give answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention Frame Lk, and it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note The attention frame L of meaning power decision-making modulek, obtain image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answering of Q Case.
Specifically, the fusion feature vector ukIt obtains in the following manner:
Wherein, FC1、FC2And FC3For full Connection Neural Network, operator [,] indicates two vectors of connection.
Specifically, the decision goes out attention frame Lk, it is specific as follows:
hagent,k+1=RNN (hagent,k,uk)
X '=FC4(hagent,k+1)
Y '=FC5(hagent,k+1)
A '=FC6(hagent,k+1)
B '=FC7(hagent,k+1)
Wherein, hagent,kInternal history state when decision secondary for kth, hagent,0For null vector, RNN is Recognition with Recurrent Neural Network, FC4、FC5、FC6And FC7For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal state The random number of distribution, decision goes out attention frame position before (x ', y ') is plus makes an uproar, and decision goes out attention frame before (a ', b ') is plus makes an uproar Length and width, decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.
Specifically, in spatial image feature v, centered on (x, y), select a length of a wide for the spy of the rectangular area of b Sign, then pondization operation is carried out to it, obtain one dimensional image feature vector
Specifically, learn adaptive attention decision process using the method for intensified learning.
Second aspect, the embodiment of the invention provides a kind of image vision answering method based on attention decision, the party Method the following steps are included:
S1. the image vision question and answer mould based on attention decision using p training sample training as described in relation to the first aspect Type, whether the reasoning answer for comparing training sample is identical as the label of the training sample, and if they are the same, then the training sample is infused every time Obtained score of anticipating is 1, is 0 otherwise;
S2. it is based on rijLoss function L is constructed, using loss function L as objective function, optimizes network parameter, after being trained The image vision Question-Answering Model based on attention decision, rijFor the score that j-th of sample i-th pays attention to, j=1 ..., P, i=1 ..., K;
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained To final result.
Specifically, optimized using batch stochastic gradient descent.
Specifically, the calculation formula of the loss function L is as follows:
Wherein, log (πθ(x, y, a, b)) expression calculating state be hagent,i+1, decision be (x, y, a, b) when loss.
The third aspect, the embodiment of the invention provides a kind of image vision question answering systems based on attention decision, this is System is using the image vision answering method based on attention decision described in above-mentioned second aspect.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Computer program is stored in matter, which realizes described in above-mentioned second aspect when being executed by processor based on attention The image vision answering method of power decision.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:
1. the present invention utilizes intensified learning, the decision process of learning characteristic selection, selection and problem phase that can be adaptive The visual signature of pass.Regard the attention change adjustment process on image as a serializing decision process, that is, continuously Image area characteristics are selected, these features are then merged, release answer.It can train end to end, the feature for allowing study to arrive is more Add with problem specific aim.
2. a region relevant to problem on the image of selection every time of the invention, records the information in this region, ignore and The unrelated information of problem, can effective exclusive PCR information.
Detailed description of the invention
Fig. 1 is the vision Question-Answering Model structural schematic diagram in the prior art based on deep learning;
Fig. 2 is a kind of image vision Question-Answering Model structural schematic diagram based on intensified learning provided in an embodiment of the present invention;
Fig. 3 is feature extraction pond provided in an embodiment of the present invention process schematic;
Fig. 4 is a kind of image vision answering method flow chart based on intensified learning provided in an embodiment of the present invention;
Fig. 5 is attention decision process schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
As shown in Fig. 2, a kind of image vision Question-Answering Model based on attention decision, the model include:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image Feature g is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, it is special in k=1, receiving and merging the global image from visual information abstracting module The problem of levying g and coming from case study module feature vector q, obtains fusion feature vector uk;Alternatively, in k=2 ..., when K, connect Receive simultaneously fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector uk;In k=1 ..., when K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukHair Give answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention Frame, and it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note The attention frame of meaning power decision-making module, obtains image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answering of Q Case.
1. visual information abstracting module
In the present invention, visual information abstracting module extracts the global image feature g and sky of image I using convolutional neural networks Between characteristics of image v.
Input picture is pre-processed, input picture I is unified for 3 channels, size is the image of 244*244 I′。
The the convolutional neural networks feature number of plies the deep more has classification distinguishing ability, and the feature vector of convolutional layer output is protected Space distribution information is stayed.Therefore, the characteristic pattern v of the last one convolutional layer of convolutional neural networks is extractedd×m×nIt is complete with the last one The feature g of articulamentum is as characteristics of image.
G=CNNfc(I′)
vd×m×n=CNNconv(I′)
Wherein, d indicates the number of convolution kernel, and m × n indicates characteristics of image map space size.The embodiment of the present invention is preferred VGG16, m=n=14, d=2048.
2. case study module
In the present invention, case study module extracts asking for problem Q using the insertion of glove word and gating cycle unit networks Inscribe feature vector q.
Each of problem word progress one-hot is encoded into c firstiIf length is greater than N, deletion exceeds Part, curtailment, using sky (null) be filled into specified length.The preferred N=14 of the embodiment of the present invention.
Q '={ c1,c2,c3,…,cN}
Using 300 term vector of glove of pre-training, by the one-hot of each word it is encoded translated be 300 dimensions word to Measure Qwe
Qwe=WE (Q)={ w1,w2,…,wN}∈RN×300
Wherein, WE indicates the function for converting word to term vector, wiIt is the corresponding term vector of each word.
By wiIt is input in gating cycle unit networks (Gated Recurrent Unit, GRU), obtains according to sequencing To representation vector q.
Q=GRU (Qwe)
3. information Fusion Module
In the present invention, information Fusion Module is by full Connection Neural Network FC1、FC2And FC3Composition, their activation primitive are ReLu function, but weight parameter is had any different.The effect of full articulamentum is to carry out Nonlinear Mapping to data, extracts abstract spy Sign.Herein, k=1, FC1Handle global image feature g, FC2Handle problem characteristic vector q;K > 1, FC1Handle characteristics of image to Amount, FC2Handle fusion feature vector;FC3Two input feature vectors are merged to obtain fusion feature.
Wherein, operator [,] indicates two vectors of connection.
4. attention decision-making module
In the present invention, attention decision-making module uses Recognition with Recurrent Neural Network RNN, FC4、FC5、FC6And FC7, decision attention out Power frame position (x, y) and length and width (a, b).hagent,0For null vector.
hagent,k+1=RNN (hagent,k,uk)
Wherein, hagent,kFor current internal historic state.
X '=FC4(hagent,k+1)
Y '=FC5(hagent,k+1)
A '=FC6(hagent,k+1)
B '=FC7(hagent,k+1)
Wherein, FC4、FC5、FC6And FC7Activation primitive is tanh activation primitive (tanh), but weight parameter has area Not.
X=x '+φ1
Y=y '+φ2
A=a '+φ3
B=b '+φ4
φ1、φ2、φ3And φ4It is 0 to be all satisfied mean value, the random number for the normal distribution that variance is 1.For the result of decision Random noise is added, increases the search capability of model, helps to search optimal solution.-1≤x≤1,-1≤y≤1,0≤a≤1, 0≤b≤1。
5. feature extraction pond module
In the present invention, feature extraction pond module is according to fusion feature vector ukImage is selected from spatial image feature v Feature vectorAs shown in figure 3, in spatial image feature v2048×14×14In, centered on (x, y), select a length of a wide for b Rectangular area feature, then it is carried out mean value pondization operation, obtain one-dimensional feature vector.It is as follows that feature obtains operation:
v2048×a×b=selector (v2048×m×n,x,y,a,b)
Wherein, AP (Average pooling) indicates the operation of mean value pondization.
6. answer reasoning module
In the present invention, answer reasoning module uses MLP (Multi-layer perceptron, Multi-layer Perceptron) reasoning is gone wrong the answer of Q.
Reasoning process is as follows:
H=FC9(FC8(u3))
Wherein, h is candidate answers scores vector, and i is the index of candidate answers, hiIt is i-th of score in vector h,It is institute There is the index of answer mid-score maximum answer, model is according to indexCorresponding answer is found in candidate answers set.Use multilayer Perceptron carries out answer reasoning.Multi-layer perception (MLP) includes FC8And FC9Two full articulamentums.FC8Activation primitive be ReLu, FC9 It does not include activation primitive, only with Linear Mapping part.
Learn adaptive attention decision process using the method for intensified learning.Environment gives to be currently used in and push away first Manage the fusion feature of answer.For intelligent body according to current fusion feature, which visual information judgement also requires supplementation with, that is, gives Which position acquisition visual information needed in next step out.Environment is taken out in space characteristics after obtaining position, according to position The visual information for taking out corresponding position, it is merged again to obtain fusion feature with current fuse information, gives it to intelligence again It can body.Iteration for several times after, last fusion feature reasoning answer will be used.If answer is correct, the prize of current decision 1 is given It encourages, otherwise reward is 0.The case where intelligent body is according to reward, changes the strategy of attention decision.
Environment (Enviroment): spatial image feature, feature extraction pond module and information Fusion Module.
Intelligent body (Agent): attention decision-making module.
State (State): fusion feature.
It acts (Action): the position (x, y) of attention frame and long a, width b.
Tactful (Policy): from present fusion feature to the mapping of the position (x, y) of attention frame and long a, width b Journey.Decision function πθFor the process for calculating attention, θ is the parameter of function.
It awards (Reward): reward functionWhen fusion feature energy logic goes out answer, ri It is 1, otherwise, riIt is 0.
The problem of about image, generally require repeatedly adjustment pay attention to position could accurately find answer a question it is a series of Information.This process is considered as a serializing decision process, that is, continuous selection image area characteristics, then merges These features release answer.The present invention uses intensified learning method, learns attention Facility location process on image.Choosing every time A region relevant to problem on image is selected, the information in this region is recorded.
As shown in figure 4, a kind of image vision answering method based on attention decision, method includes the following steps:
S1. using p image vision Question-Answering Model of the training sample training based on attention decision, each sample training warp It goes through K attention decision and obtains reasoning answer, whether compare reasoning answer identical as the label of the training sample, if they are the same, then K It is secondary to notice that obtained score is 1, it is 0 otherwise;
S2. using loss function L as objective function, optimize network parameter using batch stochastic gradient descent optimal way, obtain The image vision Question-Answering Model based on intensified learning after to training;
Wherein, log (πθ(x, y, a, b)) expression calculating state be hagent,i+1, decision be (x, y, a, b) when loss, rij The score paid attention to for j-th of sample i-th.
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on intensified learning, obtained Final result.
The label of training sample represents true answer.If the information that position frame is chosen is unrelated with problem, obtain Answer is necessarily wrong, and therefore, obtained reward is few, and the probability of the position is chosen to reduce next time, therefore, this mechanism it Under, the information unrelated with problem can less participate in calculating.
As shown in figure 5, the image I of input is the scene of a sportsman to play baseball, Q is " what is the problem of input The player ' s number? ".When answering a question, the change procedure of attention.For " what is the player ' s Number? ", model navigates to sportsman first, then navigates to the uniform number of sportsman, finally provides answer " 22 ".
More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims (10)

1. a kind of image vision Question-Answering Model based on attention decision, which is characterized in that the model includes:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image feature G is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, in k=1, receive and merge global image feature g from visual information abstracting module and The problem of from case study module feature vector q, obtain fusion feature vector uk;Alternatively, in k=2 ..., when K, receive simultaneously Fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector uk;? When k=1 ..., K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukIt is sent to Answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention frame Lk, And it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from attention The attention frame L of decision-making modulek, obtain image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answer of Q.
2. model as described in claim 1, which is characterized in that the fusion feature vector ukIt obtains in the following manner:
Wherein, FC1、FC2And FC3For full Connection Neural Network, operator [,] indicates two vectors of connection.
3. model as claimed in claim 1 or 2, which is characterized in that the decision goes out attention frame Lk, it is specific as follows:
hAgent, k+1=RNN (hAgent, k, uk)
X '=FC4(hAgent, k+1)
Y '=FC5(hAgent, k+1)
A '=FC6(hAgent, k+1)
B '=FC7(hAgent, k+1)
Wherein, hAgent, kInternal history state when decision secondary for kth, hAgent, 0For null vector, RNN is Recognition with Recurrent Neural Network, FC4、 FC5、FC6And FC7For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal distribution Random number, (x ', y ') is plus decision goes out attention frame position before making an uproar, and decision goes out attention frame length and width before (a ', b ') is plus makes an uproar, Decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.
4. model as claimed in claim 3, which is characterized in that in spatial image feature v, centered on (x, y), selection length For the feature of the rectangular area for being b a wide, then pondization operation is carried out to it, obtain one dimensional image feature vector
5. such as the described in any item models of Claims 1-4, which is characterized in that learnt using the method for intensified learning adaptive Attention decision process.
6. a kind of image vision answering method based on attention decision, which is characterized in that method includes the following steps:
S1. using p training sample training such as the image vision described in any one of claim 1 to 5 based on attention decision Question-Answering Model, whether the reasoning answer for comparing training sample is identical as the label of the training sample, if they are the same, then the training sample Notice that obtained score is 1 every time, be 0 otherwise:
S2. it is based on rijConstruct loss function L, using L as objective function, optimize network parameter, after being trained based on attention The image vision Question-Answering Model of decision, rijFor the score that j-th of sample i-th pays attention to, j=1 ..., p, i=1 ..., K;
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained most Whole answer.
7. method as claimed in claim 6, which is characterized in that optimized using batch stochastic gradient descent.
8. method as claimed in claim 6, which is characterized in that the calculation formula of the loss function L is as follows:
Wherein, log (πθ(x, y, a, b)) expression calculating state be hAgent, i+1, decision be (x, y, a, b) when loss.
9. a kind of image vision question answering system based on attention decision, which is characterized in that the system uses such as claim 6-8 Described in any item image vision answering methods based on attention decision.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor and is based on attention decision as claim 6-8 is described in any item Image vision answering method.
CN201910355026.0A 2019-04-29 2019-04-29 Image visual question-answering model, method and system based on attention decision Expired - Fee Related CN110134774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910355026.0A CN110134774B (en) 2019-04-29 2019-04-29 Image visual question-answering model, method and system based on attention decision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910355026.0A CN110134774B (en) 2019-04-29 2019-04-29 Image visual question-answering model, method and system based on attention decision

Publications (2)

Publication Number Publication Date
CN110134774A true CN110134774A (en) 2019-08-16
CN110134774B CN110134774B (en) 2021-02-09

Family

ID=67575681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910355026.0A Expired - Fee Related CN110134774B (en) 2019-04-29 2019-04-29 Image visual question-answering model, method and system based on attention decision

Country Status (1)

Country Link
CN (1) CN110134774B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598573A (en) * 2019-08-21 2019-12-20 中山大学 Visual problem common sense inference model and method based on multi-domain heterogeneous graph guidance
CN110704668A (en) * 2019-09-23 2020-01-17 北京影谱科技股份有限公司 Grid-based collaborative attention VQA method and apparatus
CN110990630A (en) * 2019-11-29 2020-04-10 清华大学 Video question-answering method based on graph modeling visual information and guided by using questions
CN111260228A (en) * 2020-01-18 2020-06-09 西安科技大学 Multi-stage task system performance evaluation method and device
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN111539292A (en) * 2020-04-17 2020-08-14 中山大学 Action decision model and method for presenting scene question-answering task
CN111754784A (en) * 2020-06-23 2020-10-09 高新兴科技集团股份有限公司 Attention mechanism-based vehicle main and sub brand identification method of multilayer network
CN111814843A (en) * 2020-03-23 2020-10-23 同济大学 End-to-end training method and application of image feature module in visual question-answering system
CN111831813A (en) * 2020-09-21 2020-10-27 北京百度网讯科技有限公司 Dialog generation method, dialog generation device, electronic equipment and medium
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN113010656A (en) * 2021-03-18 2021-06-22 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113205507A (en) * 2021-05-18 2021-08-03 合肥工业大学 Visual question answering method, system and server
CN113222026A (en) * 2021-05-18 2021-08-06 合肥工业大学 Method, system and server for visual question answering of locomotive depot scene
CN113420833A (en) * 2021-07-21 2021-09-21 南京大学 Visual question-answering method and device based on question semantic mapping
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN108959396A (en) * 2018-06-04 2018-12-07 众安信息技术服务有限公司 Machine reading model training method and device, answering method and device
CN109255359A (en) * 2018-09-27 2019-01-22 南京邮电大学 A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
CN108228703A (en) * 2017-10-31 2018-06-29 北京市商汤科技开发有限公司 Image answering method, device, system and storage medium
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
CN108959396A (en) * 2018-06-04 2018-12-07 众安信息技术服务有限公司 Machine reading model training method and device, answering method and device
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN109255359A (en) * 2018-09-27 2019-01-22 南京邮电大学 A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
QI WU ET AL: "Image Captioning and Visual Question Answering Based on Attributes and External Knowledge", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
SHENG ZHANG ET AL: "Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection", 《IEEE ACCESS》 *
XIAO LIN ET AL: "Leveraging Visual Question Answering for Image-Caption Ranking", 《EUROPEAN CONFERENCE ON COMPUTER VISION 》 *
刘海宾: "基于视觉注意的视觉问答方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李艳: "基于视觉注意力机制的图像检索方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
高静静 等: "应用于图像检索的视觉注意力模型的研究", 《测控技术》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598573A (en) * 2019-08-21 2019-12-20 中山大学 Visual problem common sense inference model and method based on multi-domain heterogeneous graph guidance
CN110598573B (en) * 2019-08-21 2022-11-25 中山大学 Visual problem common sense reasoning model and method based on multi-domain heterogeneous graph guidance
CN110704668B (en) * 2019-09-23 2022-11-04 北京影谱科技股份有限公司 Grid-based collaborative attention VQA method and device
CN110704668A (en) * 2019-09-23 2020-01-17 北京影谱科技股份有限公司 Grid-based collaborative attention VQA method and apparatus
CN110990630A (en) * 2019-11-29 2020-04-10 清华大学 Video question-answering method based on graph modeling visual information and guided by using questions
CN110990630B (en) * 2019-11-29 2022-06-24 清华大学 Video question-answering method based on graph modeling visual information and guided by using questions
CN111260228A (en) * 2020-01-18 2020-06-09 西安科技大学 Multi-stage task system performance evaluation method and device
CN111260228B (en) * 2020-01-18 2023-06-23 西安科技大学 Multi-stage task system performance evaluation method and device
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN111814843A (en) * 2020-03-23 2020-10-23 同济大学 End-to-end training method and application of image feature module in visual question-answering system
CN111814843B (en) * 2020-03-23 2024-02-27 同济大学 End-to-end training method and application of image feature module in visual question-answering system
CN111539292A (en) * 2020-04-17 2020-08-14 中山大学 Action decision model and method for presenting scene question-answering task
CN111539292B (en) * 2020-04-17 2023-07-07 中山大学 Action decision model and method for question-answering task with actualized scene
CN111754784A (en) * 2020-06-23 2020-10-09 高新兴科技集团股份有限公司 Attention mechanism-based vehicle main and sub brand identification method of multilayer network
CN111754784B (en) * 2020-06-23 2022-05-24 高新兴科技集团股份有限公司 Method for identifying main and sub brands of vehicle based on multi-layer network of attention mechanism
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN112100346B (en) * 2020-08-28 2021-07-20 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN111831813A (en) * 2020-09-21 2020-10-27 北京百度网讯科技有限公司 Dialog generation method, dialog generation device, electronic equipment and medium
CN113010656A (en) * 2021-03-18 2021-06-22 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113222026B (en) * 2021-05-18 2022-11-11 合肥工业大学 Method, system and server for visual question answering of locomotive depot scene
CN113205507B (en) * 2021-05-18 2023-03-10 合肥工业大学 Visual question answering method, system and server
CN113222026A (en) * 2021-05-18 2021-08-06 合肥工业大学 Method, system and server for visual question answering of locomotive depot scene
CN113205507A (en) * 2021-05-18 2021-08-03 合肥工业大学 Visual question answering method, system and server
CN113420833A (en) * 2021-07-21 2021-09-21 南京大学 Visual question-answering method and device based on question semantic mapping
CN113420833B (en) * 2021-07-21 2023-12-26 南京大学 Visual question answering method and device based on semantic mapping of questions
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device

Also Published As

Publication number Publication date
CN110134774B (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN110134774A (en) It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
CN110414432B (en) Training method of object recognition model, object recognition method and corresponding device
Luus et al. Multiview deep learning for land-use classification
CN110598029A (en) Fine-grained image classification method based on attention transfer mechanism
CN109902798A (en) The training method and device of deep neural network
CN108846314A (en) A kind of food materials identification system and food materials discrimination method based on deep learning
CN112052886A (en) Human body action attitude intelligent estimation method and device based on convolutional neural network
CN108510194A (en) Air control model training method, Risk Identification Method, device, equipment and medium
CN106909924A (en) A kind of remote sensing image method for quickly retrieving based on depth conspicuousness
CN108304826A (en) Facial expression recognizing method based on convolutional neural networks
CN111666919B (en) Object identification method and device, computer equipment and storage medium
CN109948526A (en) Image processing method and device, detection device and storage medium
CN110070107A (en) Object identification method and device
CN110222770A (en) A kind of vision answering method based on syntagmatic attention network
CN113190688B (en) Complex network link prediction method and system based on logical reasoning and graph convolution
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN106991666A (en) A kind of disease geo-radar image recognition methods suitable for many size pictorial informations
CN111681178A (en) Knowledge distillation-based image defogging method
CN114463675B (en) Underwater fish group activity intensity identification method and device
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN112527993A (en) Cross-media hierarchical deep video question-answer reasoning framework
CN106022293A (en) Pedestrian re-identification method of evolutionary algorithm based on self-adaption shared microhabitat
Hooshyar et al. ImageLM: Interpretable image-based learner modelling for classifying learners’ computational thinking
Radwan Neutrosophic applications in e-learning: Outcomes, challenges and trends
CN115909027A (en) Situation estimation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210209