CN110134774A - It is a kind of based on the image vision Question-Answering Model of attention decision, method and system - Google Patents
It is a kind of based on the image vision Question-Answering Model of attention decision, method and system Download PDFInfo
- Publication number
- CN110134774A CN110134774A CN201910355026.0A CN201910355026A CN110134774A CN 110134774 A CN110134774 A CN 110134774A CN 201910355026 A CN201910355026 A CN 201910355026A CN 110134774 A CN110134774 A CN 110134774A
- Authority
- CN
- China
- Prior art keywords
- decision
- attention
- module
- image
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Library & Information Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of based on the image vision Question-Answering Model of attention decision, method and system, belongs to open image vision question and answer field.It include: information Fusion Module, in k=1, amalgamation of global characteristics of image g and problem characteristic vector q to obtain fusion feature vector uk;In k=2 ..., when K, fusion feature vector uk‑1And image feature vectorObtain fusion feature vector uk;Attention decision-making module, for receiving fusion feature vector uk, decision goes out attention frame Lk, and it is sent to feature extraction pond module;Feature extraction pond module is used for reception space characteristics of image v and attention frame Lk‑1, obtain image feature vectorAnswer reasoning module, for receiving fusion feature vector uK, infer problem answers.The present invention utilizes intensified learning, the decision process of learning characteristic selection, selection visual signature relevant to problem that can be adaptive.Can train end to end, allow study to feature more there is problem specific aim.
Description
Technical field
The invention belongs to open image vision question and answer fields, more particularly, to a kind of figure based on attention decision
As vision Question-Answering Model, method and system.
Background technique
The development of deep learning has pushed the research of many high-level artificial intelligence tasks, for example, vision question and answer
(Visual question answering, VQA).Image vision question and answer are in one visual pattern of input and one and image
Hold relevant open natural language problem, intelligence system is answered by Understanding image with problem voluntarily to export natural language
Case.Vision question and answer can carry out automatic quantitative evaluation, can effectively tracing task development.Because the problem of about picture, is often
Tend to find specific visual information, therefore for many problems, answer only includes one to three words, it can be by just
The quantity assessment vision question and answer algorithm really answered a question.Fig. 1 gives most of vision Question-Answering Models based on deep learning, it
Mainly include following four module: (1) visual information abstracting module: generally using depth convolutional neural networks CNN, the mould of representative
Type has AlexNet, VGGNet, GoogLeNet and ResNet etc.;(2) case study module: deep-cycle nerve net is generally used
Network RNN, shot and long term memory network, gating cycle unit and convolutional neural networks;(3) multimodal information fusion module: common
Method has addition without carry, and step-by-step multiplies, link and bilinearity pond etc.;(4) answer reasoning module: multi-layer perception (MLP) is generally used.
Corresponding answer, therefore problem guiding are found in vision question-answering task according to the relevant range of problem in the picture
Image attention power mechanism (Attention mechanism) be solve vision question-answering task important method.Attention mechanism
Main target be by using local image characteristics, and allow model to the feature of different zones assign different importance come
It solves the problems, such as.
" soft attention " (soft attention) method that the prior art all uses is each of image region point
With a weight.But having some regions and problem unrelated in image, the weight in these regions should be arranged to zero, and " soft
The weight of attention " method distribution is difficult to converge to zero.Therefore some noise informations unrelated with problem can be introduced, influenced last
The decision of answer.On the other hand, certain methods will use the image object inspection of pre-training to be based on Object Semanteme reasoning answer
Object in survey method detection image, and obtain the feature vector of these objects.It then the use of soft attention mechanism is each
Object distributes an attention weight.But this multistage stage treatment method cannot end-to-end training (end to end
Train), characteristics of objects is caused not have problem specific aim.
Summary of the invention
In view of the drawbacks of the prior art, it is an object of the invention to solve the problems, such as the prior art and unrelated visual signature shadow
Sound answer reasoning, the characteristics of objects learnt do not have problem targetedly technical problem.
To achieve the above object, in a first aspect, the embodiment of the invention provides a kind of, the image based on attention decision is regarded
Feel Question-Answering Model, the model includes:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image
Feature g is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, it is special in k=1, receiving and merging the global image from visual information abstracting module
The problem of levying g and coming from case study module feature vector q, obtains fusion feature vector uk;Alternatively, in k=2 ..., when K, connect
Receive simultaneously fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector
uk;In k=1 ..., when K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukHair
Give answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention
Frame Lk, and it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note
The attention frame L of meaning power decision-making modulek, obtain image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answering of Q
Case.
Specifically, the fusion feature vector ukIt obtains in the following manner:
Wherein, FC1、FC2And FC3For full Connection Neural Network, operator [,] indicates two vectors of connection.
Specifically, the decision goes out attention frame Lk, it is specific as follows:
hagent,k+1=RNN (hagent,k,uk)
X '=FC4(hagent,k+1)
Y '=FC5(hagent,k+1)
A '=FC6(hagent,k+1)
B '=FC7(hagent,k+1)
Wherein, hagent,kInternal history state when decision secondary for kth, hagent,0For null vector, RNN is Recognition with Recurrent Neural Network,
FC4、FC5、FC6And FC7For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal state
The random number of distribution, decision goes out attention frame position before (x ', y ') is plus makes an uproar, and decision goes out attention frame before (a ', b ') is plus makes an uproar
Length and width, decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.
Specifically, in spatial image feature v, centered on (x, y), select a length of a wide for the spy of the rectangular area of b
Sign, then pondization operation is carried out to it, obtain one dimensional image feature vector
Specifically, learn adaptive attention decision process using the method for intensified learning.
Second aspect, the embodiment of the invention provides a kind of image vision answering method based on attention decision, the party
Method the following steps are included:
S1. the image vision question and answer mould based on attention decision using p training sample training as described in relation to the first aspect
Type, whether the reasoning answer for comparing training sample is identical as the label of the training sample, and if they are the same, then the training sample is infused every time
Obtained score of anticipating is 1, is 0 otherwise;
S2. it is based on rijLoss function L is constructed, using loss function L as objective function, optimizes network parameter, after being trained
The image vision Question-Answering Model based on attention decision, rijFor the score that j-th of sample i-th pays attention to, j=1 ...,
P, i=1 ..., K;
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained
To final result.
Specifically, optimized using batch stochastic gradient descent.
Specifically, the calculation formula of the loss function L is as follows:
Wherein, log (πθ(x, y, a, b)) expression calculating state be hagent,i+1, decision be (x, y, a, b) when loss.
The third aspect, the embodiment of the invention provides a kind of image vision question answering systems based on attention decision, this is
System is using the image vision answering method based on attention decision described in above-mentioned second aspect.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums
Computer program is stored in matter, which realizes described in above-mentioned second aspect when being executed by processor based on attention
The image vision answering method of power decision.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect
Fruit:
1. the present invention utilizes intensified learning, the decision process of learning characteristic selection, selection and problem phase that can be adaptive
The visual signature of pass.Regard the attention change adjustment process on image as a serializing decision process, that is, continuously
Image area characteristics are selected, these features are then merged, release answer.It can train end to end, the feature for allowing study to arrive is more
Add with problem specific aim.
2. a region relevant to problem on the image of selection every time of the invention, records the information in this region, ignore and
The unrelated information of problem, can effective exclusive PCR information.
Detailed description of the invention
Fig. 1 is the vision Question-Answering Model structural schematic diagram in the prior art based on deep learning;
Fig. 2 is a kind of image vision Question-Answering Model structural schematic diagram based on intensified learning provided in an embodiment of the present invention;
Fig. 3 is feature extraction pond provided in an embodiment of the present invention process schematic;
Fig. 4 is a kind of image vision answering method flow chart based on intensified learning provided in an embodiment of the present invention;
Fig. 5 is attention decision process schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
As shown in Fig. 2, a kind of image vision Question-Answering Model based on attention decision, the model include:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image
Feature g is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, it is special in k=1, receiving and merging the global image from visual information abstracting module
The problem of levying g and coming from case study module feature vector q, obtains fusion feature vector uk;Alternatively, in k=2 ..., when K, connect
Receive simultaneously fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector
uk;In k=1 ..., when K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukHair
Give answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention
Frame, and it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from note
The attention frame of meaning power decision-making module, obtains image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answering of Q
Case.
1. visual information abstracting module
In the present invention, visual information abstracting module extracts the global image feature g and sky of image I using convolutional neural networks
Between characteristics of image v.
Input picture is pre-processed, input picture I is unified for 3 channels, size is the image of 244*244
I′。
The the convolutional neural networks feature number of plies the deep more has classification distinguishing ability, and the feature vector of convolutional layer output is protected
Space distribution information is stayed.Therefore, the characteristic pattern v of the last one convolutional layer of convolutional neural networks is extractedd×m×nIt is complete with the last one
The feature g of articulamentum is as characteristics of image.
G=CNNfc(I′)
vd×m×n=CNNconv(I′)
Wherein, d indicates the number of convolution kernel, and m × n indicates characteristics of image map space size.The embodiment of the present invention is preferred
VGG16, m=n=14, d=2048.
2. case study module
In the present invention, case study module extracts asking for problem Q using the insertion of glove word and gating cycle unit networks
Inscribe feature vector q.
Each of problem word progress one-hot is encoded into c firstiIf length is greater than N, deletion exceeds
Part, curtailment, using sky (null) be filled into specified length.The preferred N=14 of the embodiment of the present invention.
Q '={ c1,c2,c3,…,cN}
Using 300 term vector of glove of pre-training, by the one-hot of each word it is encoded translated be 300 dimensions word to
Measure Qwe。
Qwe=WE (Q)={ w1,w2,…,wN}∈RN×300
Wherein, WE indicates the function for converting word to term vector, wiIt is the corresponding term vector of each word.
By wiIt is input in gating cycle unit networks (Gated Recurrent Unit, GRU), obtains according to sequencing
To representation vector q.
Q=GRU (Qwe)
3. information Fusion Module
In the present invention, information Fusion Module is by full Connection Neural Network FC1、FC2And FC3Composition, their activation primitive are
ReLu function, but weight parameter is had any different.The effect of full articulamentum is to carry out Nonlinear Mapping to data, extracts abstract spy
Sign.Herein, k=1, FC1Handle global image feature g, FC2Handle problem characteristic vector q;K > 1, FC1Handle characteristics of image to
Amount, FC2Handle fusion feature vector;FC3Two input feature vectors are merged to obtain fusion feature.
Wherein, operator [,] indicates two vectors of connection.
4. attention decision-making module
In the present invention, attention decision-making module uses Recognition with Recurrent Neural Network RNN, FC4、FC5、FC6And FC7, decision attention out
Power frame position (x, y) and length and width (a, b).hagent,0For null vector.
hagent,k+1=RNN (hagent,k,uk)
Wherein, hagent,kFor current internal historic state.
X '=FC4(hagent,k+1)
Y '=FC5(hagent,k+1)
A '=FC6(hagent,k+1)
B '=FC7(hagent,k+1)
Wherein, FC4、FC5、FC6And FC7Activation primitive is tanh activation primitive (tanh), but weight parameter has area
Not.
X=x '+φ1
Y=y '+φ2
A=a '+φ3
B=b '+φ4
φ1、φ2、φ3And φ4It is 0 to be all satisfied mean value, the random number for the normal distribution that variance is 1.For the result of decision
Random noise is added, increases the search capability of model, helps to search optimal solution.-1≤x≤1,-1≤y≤1,0≤a≤1,
0≤b≤1。
5. feature extraction pond module
In the present invention, feature extraction pond module is according to fusion feature vector ukImage is selected from spatial image feature v
Feature vectorAs shown in figure 3, in spatial image feature v2048×14×14In, centered on (x, y), select a length of a wide for b
Rectangular area feature, then it is carried out mean value pondization operation, obtain one-dimensional feature vector.It is as follows that feature obtains operation:
v2048×a×b=selector (v2048×m×n,x,y,a,b)
Wherein, AP (Average pooling) indicates the operation of mean value pondization.
6. answer reasoning module
In the present invention, answer reasoning module uses MLP (Multi-layer perceptron, Multi-layer
Perceptron) reasoning is gone wrong the answer of Q.
Reasoning process is as follows:
H=FC9(FC8(u3))
Wherein, h is candidate answers scores vector, and i is the index of candidate answers, hiIt is i-th of score in vector h,It is institute
There is the index of answer mid-score maximum answer, model is according to indexCorresponding answer is found in candidate answers set.Use multilayer
Perceptron carries out answer reasoning.Multi-layer perception (MLP) includes FC8And FC9Two full articulamentums.FC8Activation primitive be ReLu, FC9
It does not include activation primitive, only with Linear Mapping part.
Learn adaptive attention decision process using the method for intensified learning.Environment gives to be currently used in and push away first
Manage the fusion feature of answer.For intelligent body according to current fusion feature, which visual information judgement also requires supplementation with, that is, gives
Which position acquisition visual information needed in next step out.Environment is taken out in space characteristics after obtaining position, according to position
The visual information for taking out corresponding position, it is merged again to obtain fusion feature with current fuse information, gives it to intelligence again
It can body.Iteration for several times after, last fusion feature reasoning answer will be used.If answer is correct, the prize of current decision 1 is given
It encourages, otherwise reward is 0.The case where intelligent body is according to reward, changes the strategy of attention decision.
Environment (Enviroment): spatial image feature, feature extraction pond module and information Fusion Module.
Intelligent body (Agent): attention decision-making module.
State (State): fusion feature.
It acts (Action): the position (x, y) of attention frame and long a, width b.
Tactful (Policy): from present fusion feature to the mapping of the position (x, y) of attention frame and long a, width b
Journey.Decision function πθFor the process for calculating attention, θ is the parameter of function.
It awards (Reward): reward functionWhen fusion feature energy logic goes out answer, ri
It is 1, otherwise, riIt is 0.
The problem of about image, generally require repeatedly adjustment pay attention to position could accurately find answer a question it is a series of
Information.This process is considered as a serializing decision process, that is, continuous selection image area characteristics, then merges
These features release answer.The present invention uses intensified learning method, learns attention Facility location process on image.Choosing every time
A region relevant to problem on image is selected, the information in this region is recorded.
As shown in figure 4, a kind of image vision answering method based on attention decision, method includes the following steps:
S1. using p image vision Question-Answering Model of the training sample training based on attention decision, each sample training warp
It goes through K attention decision and obtains reasoning answer, whether compare reasoning answer identical as the label of the training sample, if they are the same, then K
It is secondary to notice that obtained score is 1, it is 0 otherwise;
S2. using loss function L as objective function, optimize network parameter using batch stochastic gradient descent optimal way, obtain
The image vision Question-Answering Model based on intensified learning after to training;
Wherein, log (πθ(x, y, a, b)) expression calculating state be hagent,i+1, decision be (x, y, a, b) when loss, rij
The score paid attention to for j-th of sample i-th.
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on intensified learning, obtained
Final result.
The label of training sample represents true answer.If the information that position frame is chosen is unrelated with problem, obtain
Answer is necessarily wrong, and therefore, obtained reward is few, and the probability of the position is chosen to reduce next time, therefore, this mechanism it
Under, the information unrelated with problem can less participate in calculating.
As shown in figure 5, the image I of input is the scene of a sportsman to play baseball, Q is " what is the problem of input
The player ' s number? ".When answering a question, the change procedure of attention.For " what is the player ' s
Number? ", model navigates to sportsman first, then navigates to the uniform number of sportsman, finally provides answer " 22 ".
More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any
Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.
Claims (10)
1. a kind of image vision Question-Answering Model based on attention decision, which is characterized in that the model includes:
Visual information abstracting module, for extracting the global image feature g and spatial image feature v of image I, global image feature
G is sent to information Fusion Module, and spatial image feature v is sent to feature extraction pond module;
Case study module, feature vector q the problem of for extracting problem Q, and it is sent to information Fusion Module;
Information Fusion Module, in k=1, receive and merge global image feature g from visual information abstracting module and
The problem of from case study module feature vector q, obtain fusion feature vector uk;Alternatively, in k=2 ..., when K, receive simultaneously
Fusion feature vector uk-1With the image feature vector from feature extraction pond moduleObtain fusion feature vector uk;?
When k=1 ..., K-1, fusion feature vector ukIt is sent to attention decision-making module, in k=K, fusion feature vector ukIt is sent to
Answer reasoning module, wherein k indicates that fusion number, K indicate fusion total degree;
Attention decision-making module, for receiving the fusion feature vector u for carrying out self-information Fusion Modulek, decision goes out attention frame Lk,
And it is sent to feature extraction pond module;
Feature extraction pond module, for receiving the spatial image feature v from visual information abstracting module and coming from attention
The attention frame L of decision-making modulek, obtain image feature vector
Answer reasoning module, for receiving the fusion feature vector u for carrying out self-information Fusion ModuleK, reasoning goes wrong the answer of Q.
2. model as described in claim 1, which is characterized in that the fusion feature vector ukIt obtains in the following manner:
Wherein, FC1、FC2And FC3For full Connection Neural Network, operator [,] indicates two vectors of connection.
3. model as claimed in claim 1 or 2, which is characterized in that the decision goes out attention frame Lk, it is specific as follows:
hAgent, k+1=RNN (hAgent, k, uk)
X '=FC4(hAgent, k+1)
Y '=FC5(hAgent, k+1)
A '=FC6(hAgent, k+1)
B '=FC7(hAgent, k+1)
Wherein, hAgent, kInternal history state when decision secondary for kth, hAgent, 0For null vector, RNN is Recognition with Recurrent Neural Network, FC4、
FC5、FC6And FC7For full Connection Neural Network,WithTo be all satisfied mean value be 0, variance is 1 normal distribution
Random number, (x ', y ') is plus decision goes out attention frame position before making an uproar, and decision goes out attention frame length and width before (a ', b ') is plus makes an uproar,
Decision goes out attention frame position after (x, y) is plus makes an uproar, and decision goes out attention frame length and width after (a, b) is plus makes an uproar.
4. model as claimed in claim 3, which is characterized in that in spatial image feature v, centered on (x, y), selection length
For the feature of the rectangular area for being b a wide, then pondization operation is carried out to it, obtain one dimensional image feature vector
5. such as the described in any item models of Claims 1-4, which is characterized in that learnt using the method for intensified learning adaptive
Attention decision process.
6. a kind of image vision answering method based on attention decision, which is characterized in that method includes the following steps:
S1. using p training sample training such as the image vision described in any one of claim 1 to 5 based on attention decision
Question-Answering Model, whether the reasoning answer for comparing training sample is identical as the label of the training sample, if they are the same, then the training sample
Notice that obtained score is 1 every time, be 0 otherwise:
S2. it is based on rijConstruct loss function L, using L as objective function, optimize network parameter, after being trained based on attention
The image vision Question-Answering Model of decision, rijFor the score that j-th of sample i-th pays attention to, j=1 ..., p, i=1 ..., K;
S3. testing image I and problem Q are inputted into the trained image vision Question-Answering Model based on attention decision, obtained most
Whole answer.
7. method as claimed in claim 6, which is characterized in that optimized using batch stochastic gradient descent.
8. method as claimed in claim 6, which is characterized in that the calculation formula of the loss function L is as follows:
Wherein, log (πθ(x, y, a, b)) expression calculating state be hAgent, i+1, decision be (x, y, a, b) when loss.
9. a kind of image vision question answering system based on attention decision, which is characterized in that the system uses such as claim 6-8
Described in any item image vision answering methods based on attention decision.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program is realized when the computer program is executed by processor and is based on attention decision as claim 6-8 is described in any item
Image vision answering method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355026.0A CN110134774B (en) | 2019-04-29 | 2019-04-29 | Image visual question-answering model, method and system based on attention decision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910355026.0A CN110134774B (en) | 2019-04-29 | 2019-04-29 | Image visual question-answering model, method and system based on attention decision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134774A true CN110134774A (en) | 2019-08-16 |
CN110134774B CN110134774B (en) | 2021-02-09 |
Family
ID=67575681
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910355026.0A Expired - Fee Related CN110134774B (en) | 2019-04-29 | 2019-04-29 | Image visual question-answering model, method and system based on attention decision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134774B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598573A (en) * | 2019-08-21 | 2019-12-20 | 中山大学 | Visual problem common sense inference model and method based on multi-domain heterogeneous graph guidance |
CN110704668A (en) * | 2019-09-23 | 2020-01-17 | 北京影谱科技股份有限公司 | Grid-based collaborative attention VQA method and apparatus |
CN110990630A (en) * | 2019-11-29 | 2020-04-10 | 清华大学 | Video question-answering method based on graph modeling visual information and guided by using questions |
CN111260228A (en) * | 2020-01-18 | 2020-06-09 | 西安科技大学 | Multi-stage task system performance evaluation method and device |
CN111325243A (en) * | 2020-02-03 | 2020-06-23 | 天津大学 | Visual relation detection method based on regional attention learning mechanism |
CN111539292A (en) * | 2020-04-17 | 2020-08-14 | 中山大学 | Action decision model and method for presenting scene question-answering task |
CN111754784A (en) * | 2020-06-23 | 2020-10-09 | 高新兴科技集团股份有限公司 | Attention mechanism-based vehicle main and sub brand identification method of multilayer network |
CN111814843A (en) * | 2020-03-23 | 2020-10-23 | 同济大学 | End-to-end training method and application of image feature module in visual question-answering system |
CN111831813A (en) * | 2020-09-21 | 2020-10-27 | 北京百度网讯科技有限公司 | Dialog generation method, dialog generation device, electronic equipment and medium |
CN112100346A (en) * | 2020-08-28 | 2020-12-18 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN113010656A (en) * | 2021-03-18 | 2021-06-22 | 广东工业大学 | Visual question-answering method based on multi-mode fusion and structural control |
CN113205507A (en) * | 2021-05-18 | 2021-08-03 | 合肥工业大学 | Visual question answering method, system and server |
CN113222026A (en) * | 2021-05-18 | 2021-08-06 | 合肥工业大学 | Method, system and server for visual question answering of locomotive depot scene |
CN113420833A (en) * | 2021-07-21 | 2021-09-21 | 南京大学 | Visual question-answering method and device based on question semantic mapping |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114417044A (en) * | 2022-01-19 | 2022-04-29 | 中国科学院空天信息创新研究院 | Image question and answer method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107679582A (en) * | 2017-10-20 | 2018-02-09 | 深圳市唯特视科技有限公司 | A kind of method that visual question and answer are carried out based on multi-modal decomposition model |
CN108170816A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of intelligent vision Question-Answering Model based on deep neural network |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108920587A (en) * | 2018-06-26 | 2018-11-30 | 清华大学 | Merge the open field vision answering method and device of external knowledge |
CN108959396A (en) * | 2018-06-04 | 2018-12-07 | 众安信息技术服务有限公司 | Machine reading model training method and device, answering method and device |
CN109255359A (en) * | 2018-09-27 | 2019-01-22 | 南京邮电大学 | A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method |
-
2019
- 2019-04-29 CN CN201910355026.0A patent/CN110134774B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107679582A (en) * | 2017-10-20 | 2018-02-09 | 深圳市唯特视科技有限公司 | A kind of method that visual question and answer are carried out based on multi-modal decomposition model |
CN108228703A (en) * | 2017-10-31 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image answering method, device, system and storage medium |
CN108170816A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of intelligent vision Question-Answering Model based on deep neural network |
CN108959396A (en) * | 2018-06-04 | 2018-12-07 | 众安信息技术服务有限公司 | Machine reading model training method and device, answering method and device |
CN108920587A (en) * | 2018-06-26 | 2018-11-30 | 清华大学 | Merge the open field vision answering method and device of external knowledge |
CN109255359A (en) * | 2018-09-27 | 2019-01-22 | 南京邮电大学 | A kind of vision question and answer problem-solving approach based on Complex Networks Analysis method |
Non-Patent Citations (6)
Title |
---|
QI WU ET AL: "Image Captioning and Visual Question Answering Based on Attributes and External Knowledge", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 * |
SHENG ZHANG ET AL: "Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection", 《IEEE ACCESS》 * |
XIAO LIN ET AL: "Leveraging Visual Question Answering for Image-Caption Ranking", 《EUROPEAN CONFERENCE ON COMPUTER VISION 》 * |
刘海宾: "基于视觉注意的视觉问答方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
李艳: "基于视觉注意力机制的图像检索方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
高静静 等: "应用于图像检索的视觉注意力模型的研究", 《测控技术》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598573A (en) * | 2019-08-21 | 2019-12-20 | 中山大学 | Visual problem common sense inference model and method based on multi-domain heterogeneous graph guidance |
CN110598573B (en) * | 2019-08-21 | 2022-11-25 | 中山大学 | Visual problem common sense reasoning model and method based on multi-domain heterogeneous graph guidance |
CN110704668B (en) * | 2019-09-23 | 2022-11-04 | 北京影谱科技股份有限公司 | Grid-based collaborative attention VQA method and device |
CN110704668A (en) * | 2019-09-23 | 2020-01-17 | 北京影谱科技股份有限公司 | Grid-based collaborative attention VQA method and apparatus |
CN110990630A (en) * | 2019-11-29 | 2020-04-10 | 清华大学 | Video question-answering method based on graph modeling visual information and guided by using questions |
CN110990630B (en) * | 2019-11-29 | 2022-06-24 | 清华大学 | Video question-answering method based on graph modeling visual information and guided by using questions |
CN111260228A (en) * | 2020-01-18 | 2020-06-09 | 西安科技大学 | Multi-stage task system performance evaluation method and device |
CN111260228B (en) * | 2020-01-18 | 2023-06-23 | 西安科技大学 | Multi-stage task system performance evaluation method and device |
CN111325243A (en) * | 2020-02-03 | 2020-06-23 | 天津大学 | Visual relation detection method based on regional attention learning mechanism |
CN111814843A (en) * | 2020-03-23 | 2020-10-23 | 同济大学 | End-to-end training method and application of image feature module in visual question-answering system |
CN111814843B (en) * | 2020-03-23 | 2024-02-27 | 同济大学 | End-to-end training method and application of image feature module in visual question-answering system |
CN111539292A (en) * | 2020-04-17 | 2020-08-14 | 中山大学 | Action decision model and method for presenting scene question-answering task |
CN111539292B (en) * | 2020-04-17 | 2023-07-07 | 中山大学 | Action decision model and method for question-answering task with actualized scene |
CN111754784A (en) * | 2020-06-23 | 2020-10-09 | 高新兴科技集团股份有限公司 | Attention mechanism-based vehicle main and sub brand identification method of multilayer network |
CN111754784B (en) * | 2020-06-23 | 2022-05-24 | 高新兴科技集团股份有限公司 | Method for identifying main and sub brands of vehicle based on multi-layer network of attention mechanism |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN112100346B (en) * | 2020-08-28 | 2021-07-20 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112100346A (en) * | 2020-08-28 | 2020-12-18 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN111831813A (en) * | 2020-09-21 | 2020-10-27 | 北京百度网讯科技有限公司 | Dialog generation method, dialog generation device, electronic equipment and medium |
CN113010656A (en) * | 2021-03-18 | 2021-06-22 | 广东工业大学 | Visual question-answering method based on multi-mode fusion and structural control |
CN113222026B (en) * | 2021-05-18 | 2022-11-11 | 合肥工业大学 | Method, system and server for visual question answering of locomotive depot scene |
CN113205507B (en) * | 2021-05-18 | 2023-03-10 | 合肥工业大学 | Visual question answering method, system and server |
CN113222026A (en) * | 2021-05-18 | 2021-08-06 | 合肥工业大学 | Method, system and server for visual question answering of locomotive depot scene |
CN113205507A (en) * | 2021-05-18 | 2021-08-03 | 合肥工业大学 | Visual question answering method, system and server |
CN113420833A (en) * | 2021-07-21 | 2021-09-21 | 南京大学 | Visual question-answering method and device based on question semantic mapping |
CN113420833B (en) * | 2021-07-21 | 2023-12-26 | 南京大学 | Visual question answering method and device based on semantic mapping of questions |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114417044A (en) * | 2022-01-19 | 2022-04-29 | 中国科学院空天信息创新研究院 | Image question and answer method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110134774B (en) | 2021-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134774A (en) | It is a kind of based on the image vision Question-Answering Model of attention decision, method and system | |
CN110414432B (en) | Training method of object recognition model, object recognition method and corresponding device | |
Luus et al. | Multiview deep learning for land-use classification | |
CN110598029A (en) | Fine-grained image classification method based on attention transfer mechanism | |
CN109902798A (en) | The training method and device of deep neural network | |
CN108846314A (en) | A kind of food materials identification system and food materials discrimination method based on deep learning | |
CN112052886A (en) | Human body action attitude intelligent estimation method and device based on convolutional neural network | |
CN108510194A (en) | Air control model training method, Risk Identification Method, device, equipment and medium | |
CN106909924A (en) | A kind of remote sensing image method for quickly retrieving based on depth conspicuousness | |
CN108304826A (en) | Facial expression recognizing method based on convolutional neural networks | |
CN111666919B (en) | Object identification method and device, computer equipment and storage medium | |
CN109948526A (en) | Image processing method and device, detection device and storage medium | |
CN110070107A (en) | Object identification method and device | |
CN110222770A (en) | A kind of vision answering method based on syntagmatic attention network | |
CN113190688B (en) | Complex network link prediction method and system based on logical reasoning and graph convolution | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN106991666A (en) | A kind of disease geo-radar image recognition methods suitable for many size pictorial informations | |
CN111681178A (en) | Knowledge distillation-based image defogging method | |
CN114463675B (en) | Underwater fish group activity intensity identification method and device | |
CN113360621A (en) | Scene text visual question-answering method based on modal inference graph neural network | |
CN112527993A (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN106022293A (en) | Pedestrian re-identification method of evolutionary algorithm based on self-adaption shared microhabitat | |
Hooshyar et al. | ImageLM: Interpretable image-based learner modelling for classifying learners’ computational thinking | |
Radwan | Neutrosophic applications in e-learning: Outcomes, challenges and trends | |
CN115909027A (en) | Situation estimation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210209 |