CN110516791A - A kind of vision answering method and system based on multiple attention - Google Patents

A kind of vision answering method and system based on multiple attention Download PDF

Info

Publication number
CN110516791A
CN110516791A CN201910770172.XA CN201910770172A CN110516791A CN 110516791 A CN110516791 A CN 110516791A CN 201910770172 A CN201910770172 A CN 201910770172A CN 110516791 A CN110516791 A CN 110516791A
Authority
CN
China
Prior art keywords
term
information
short
answer
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910770172.XA
Other languages
Chinese (zh)
Other versions
CN110516791B (en
Inventor
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yingpu Technology Co Ltd
Original Assignee
Beijing Yingpu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yingpu Technology Co Ltd filed Critical Beijing Yingpu Technology Co Ltd
Priority to CN201910770172.XA priority Critical patent/CN110516791B/en
Publication of CN110516791A publication Critical patent/CN110516791A/en
Application granted granted Critical
Publication of CN110516791B publication Critical patent/CN110516791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

This application discloses a kind of vision answering methods and system based on multiple attention, in method provided by the present application, first obtain the pictorial information and corresponding problem information in vision question and answer to be processed, extract the picture feature data in pictorial information, then picture feature data and problem information are inputted to the first long memory network in short-term based on attention mechanism simultaneously, pass through divided attention power Weight Acquisition picture content information, again by the combination of two two-way length memory network completion in short-term problem and picture, the answer of vision question and answer to be processed is exported.Based on the vision answering method and system provided by the present application based on multiple attention, increase some memory modules in the calculating process of R-CNN network to improve the knowledge source of model in the training process, generate more diversified and reasonable answer, fusion priori knowledge relevant to problem, improves the whole accuracy rate in question answering while memory network realizes extraction problem information end to end.

Description

A kind of vision answering method and system based on multiple attention
Technical field
This application involves vision question and answer fields, more particularly to a kind of vision answering method based on multiple attention and are System.
Background technique
Vision question and answer be it is a kind of be related to the learning tasks of computer vision and natural language processing, exactly allow computer learning The picture and problem of input export one and meet natural language rule and the logical answer of content, it is according to the difference of problem The object with certain a part in picture is only focused, and certain problems need certain commonsense reasoning that can just obtain answer, so, Vision question and answer require higher, also facing bigger challenge compared to general picture talk on the semantic understanding to image.
The existing model in vision question and answer field has Deeper LSTM Q+norm I model and VIS+LSTM model etc. at present, but Other than having higher accuracy rate on the simple problem for answering single answer, the accuracy rate of other aspect models is generally relatively low, Structure is also relatively easy, and the content and form of answer is relatively simple, the slightly complicated more priori knowledges of needs is carried out simple The problem of reasoning, can not make correct answer.
Summary of the invention
Aiming to overcome that the above problem or at least being partially solved or extenuate for the application solves the above problems.
According to the one aspect of the application, a kind of vision answering method based on multiple attention is provided, comprising:
Obtain the pictorial information and problem information corresponding with the pictorial information in vision question and answer to be processed;
Extract the picture feature data in the pictorial information;
The picture feature data and described problem information are inputted the first length based on attention mechanism simultaneously to remember in short-term Recall network, passes through divided attention power Weight Acquisition picture content information corresponding with the picture feature data;
The the second long memory network in short-term for executing semantic analysis is communicated with the described first long memory network in short-term, with Combination is made inferences to the picture content information and described problem information, obtain and exports answering for the vision question and answer to be processed Case.
Optionally, the picture feature data extracted in the pictorial information, comprising:
Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information;
At least one area characteristic information in each characteristic area is extracted by ResNet;
For each characteristic area, feature sieve is carried out to the area characteristic information in this feature region according to degree of overlapping IoU Choosing, and the area characteristic information after screening is made into average pond, and then obtain the provincial characteristics data of each characteristic area.
Optionally, before the picture feature data extracted in the pictorial information, further includes:
Pre-training is carried out to the R-CNN and/or ResNet based on preset data collection;
By each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification;
Attribute classification and non-is done into the full articulamentum output that fused vector is transmitted to the R-CNN and/or ResNet The other softmax classification of Attribute class.
Optionally, the picture feature data and described problem information inputted simultaneously based on attention mechanism One long memory network in short-term, by divided attention power Weight Acquisition picture content information corresponding with the picture feature data, Include:
The picture feature data and described problem information are inputted the first length based on attention mechanism simultaneously to remember in short-term Recall network;Wherein, on described first long each timestamp of memory network in short-term, input includes: a upper timestamp Second grows the defeated of the long memory network in short-term of the output of memory network in short-term, each provincial characteristics data and a upper timestamp first Out;Its output includes: to each provincial characteristics data divided attention power weight;
It is corresponding with each provincial characteristics data interior based on each provincial characteristics data attention Weight Acquisition Hold information.
Optionally, the second long memory network in short-term and the first long memory network in short-term that semantic analysis will be executed It is communicated, to make inferences combination to the picture content information and described problem information, obtains and export described to be processed The answer of vision question and answer, comprising:
Described problem information and the corresponding content information of each provincial characteristics data are inputted into execution semantic analysis simultaneously The second long memory network in short-term;Wherein, the input of the described second long each timestamp of memory network in short-term includes: the first length When the hidden layer output of memory network, the output of timestamp and described problem letter on the described second long memory network in short-term A term vector in breath;
It is directed to the answer of problem based on the described second long output of memory network in short-term, and calculated result is output to Softmax layers, the maximum term vector of select probability is exported as the answer of the vision question and answer to be processed.
According to further aspect of the application, a kind of vision question answering system based on multiple attention is provided, comprising:
Data obtaining module, the pictorial information for being configured to obtain in vision question and answer to be processed and with the pictorial information Corresponding problem information;
Picture feature extraction module is configured to extract the picture feature data in the pictorial information;
Image content obtains module, is configured to the picture feature data and described problem information while input is based on The long memory network in short-term of the first of attention mechanism, it is corresponding with the picture feature data by divided attention power Weight Acquisition Picture content information;
Answer output module is configured to that the second long memory network in short-term of semantic analysis and first length will be executed When memory network communicated, making inferences combination to the picture content information and described problem information, obtaining and exporting The answer of the vision question and answer to be processed.
Optionally, the picture feature extraction module, is configured to:
Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information;
At least one area characteristic information in each characteristic area is extracted by ResNet;
For each characteristic area, feature sieve is carried out to the area characteristic information in this feature region according to degree of overlapping IoU Choosing, and the area characteristic information after screening is made into average pond, and then obtain the provincial characteristics data of each characteristic area.
Optionally, the system also includes:
Pre-training module is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection;It will Each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification;By it is fused to Attribute classification and the other softmax of non-Attribute class points are done in the full articulamentum output that amount is transmitted to the R-CNN and/or ResNet Class.
Optionally, the image content obtains module, is configured to:
The picture feature data are inputted to the first long memory network in short-term based on attention mechanism;Wherein, described On first long each timestamp of memory network in short-term, input includes: the second long memory network in short-term of a upper timestamp Output, the long memory network in short-term of each provincial characteristics data and a upper timestamp first output;Its output includes: to each The provincial characteristics data divided attention power weight;
It is corresponding with each provincial characteristics data interior based on each provincial characteristics data attention Weight Acquisition Hold information.
Optionally, the answer output module, is configured to:
Described problem information and the corresponding content information of each provincial characteristics data are inputted into execution semantic analysis simultaneously The second long memory network in short-term;Wherein, the input of the described second long each timestamp of memory network in short-term includes: the first length When the hidden layer output of memory network, the output of timestamp and described problem letter on the described second long memory network in short-term A term vector in breath;
It is directed to the answer of problem based on the described second long output of memory network in short-term, and calculated result is output to Softmax layers, the maximum term vector of select probability is exported as the answer of the vision question and answer to be processed.
This application provides a kind of vision answering methods and system based on multiple attention, in method provided by the present application In, the pictorial information and corresponding problem information in vision question and answer to be processed are first obtained, the picture feature in pictorial information is extracted Then picture feature data and problem information are inputted the first long memory network in short-term based on attention mechanism by data simultaneously, By divided attention power Weight Acquisition picture content information, then by two two-way length, memory network completes problem and picture in short-term Combination, export the answer of vision question and answer to be processed.
Based on the vision answering method and system provided by the present application based on multiple attention, in the calculating of R-CNN network Increase some memory modules in the process to improve the knowledge source of model in the training process, generate more diversified and reasonably answers Case, fusion priori knowledge relevant to problem while memory network realizes extraction problem information end to end, improves problem and returns The whole accuracy rate answered.
According to the accompanying drawings to the detailed description of the specific embodiment of the application, those skilled in the art will be more Above-mentioned and other purposes, the advantages and features of the application are illustrated.
Detailed description of the invention
Some specific embodiments of the application are described in detail by way of example and not limitation with reference to the accompanying drawings hereinafter. Identical appended drawing reference denotes same or similar part or part in attached drawing.It should be appreciated by those skilled in the art that these What attached drawing was not necessarily drawn to scale.In attached drawing:
Fig. 1 is the vision answering method flow diagram based on multiple attention according to the embodiment of the present application;
Fig. 2 is the two-way LSTM workflow schematic diagram according to the embodiment of the present application;
Fig. 3 is pictorial information schematic diagram in the vision question and answer according to the embodiment of the present application;
Fig. 4 is the vision question answering system structural schematic diagram based on multiple attention according to the embodiment of the present application.
Fig. 5 is the vision question answering system structural schematic diagram based on multiple attention according to the application preferred embodiment;
Fig. 6 is the calculating equipment schematic diagram according to the embodiment of the present application;
Fig. 7 is the computer readable storage medium schematic diagram according to this social situation embodiment.
Specific embodiment
Fig. 1 is the vision answering method flow diagram based on multiple attention according to the embodiment of the present application.Referring to Fig. 1 Known, the vision answering method provided by the embodiments of the present application based on multiple attention may include:
Step S101: the pictorial information and problem information corresponding with pictorial information in vision question and answer to be processed are obtained;
Step S102: the picture feature data in pictorial information are extracted;
Step S103: above-mentioned picture feature data and problem information are inputted into the first length based on attention mechanism simultaneously When memory network, pass through divided attention power Weight Acquisition picture content information corresponding with picture feature data;
Step S104: the second long memory network in short-term for executing semantic analysis is led to the first long memory network in short-term Letter, to make inferences combination to picture content information and problem information, obtains and exports the answer of vision question and answer to be processed.
The embodiment of the present application provides a kind of vision answering method based on multiple attention, provides in the embodiment of the present application Method in, first obtain the pictorial information and corresponding problem information in vision question and answer to be processed, extract the figure in pictorial information Then picture feature data and problem information are inputted the first long short-term memory based on attention mechanism by piece characteristic simultaneously Network, by divided attention power Weight Acquisition picture content information, then by two two-way length, memory network completes problem in short-term With the combination of picture, the answer of vision question and answer to be processed is exported.
Long memory network (Long Short-Term Memory, be abbreviated as LSTM) in short-term, is a kind of time recurrent neural Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence, and the system based on LSTM can To learn interpreter language, control robot, image analysis, documentation summary, speech recognition, image recognition, handwriting recognition, control are chatted Its robot, predictive disease, clicking rate and stock, composite music etc. task.
For traditional vision Question-Answering Model Deeper LSTM Q+norm I model, wherein I is indicated after extracting Picture feature, norm I indicate that extracted by CNN 1024 dimension pixel semantic vectors do L2 normalized.Image is extracted by CNN Semantic information, then the text semantic information for including in problem is taken by LSTM, merges the data of two networks, allows model learning To the meaning of problem, a multilayer MLP is finally sent into as Softmax output layer and generates answer output.Input contains one The image of 2 dry goods and 2 people in outdoor scene, these data pass through the CNN processing without classification layer, the semanteme of problematic portion Problem information is then extracted according to the input sequence of problem word by a RNN network in turn, then by two compressed letters Breath fusion is sent into treated data in MLP and generates result output (for example current problem is that counting (count object) is asked Topic).The model divides region to image using 2 layers of LSTM encoded question and with VGGNet model, and characteristics of image is then made L2 Normalization.Image and problem characteristic are transformed into the same feature space later, merged information by way of dot product, is merged Information afterwards is sent into one using Softmax to generate answer output in three layers of MLP of classifier.Model during training, Pre-CNN, only LSTM layers and last sorter network participation training.
VIS+LSTM model, basic structure are to extract pictorial information using CNN first, and the LSTM that is followed by of completion is generated in advance Survey result.But it considers the standard due to none intact evaluation answer sentence accuracy, they are by attention Limited field question is concentrated on, these problems can use a word as the answer of vision question and answer, thus can be view Feel that question and answer are considered as classification problem more than one, so as to measure answer using existing accuracy evaluation criterion.
The whole accuracy of above-mentioned mentioned each model is not high, and the content and form of answer is relatively simple.
In the present embodiment, the first long memory network in short-term based on attention mechanism (may be simply referred to as attention below LSTM), one model of training to carry out list entries the study of selectivity and in model output mainly in LSTM The relevant information of correspondence considered in input can be selectively absorbed in.The long memory network in short-term of the second of execution semantic analysis (with Under may be simply referred to as semantic LSTM), the profound concept with learning text, picture etc. is excavated mainly in LSTM, can avoid ladder Spend disappearance problem.Wherein, attention LSTM and semanteme LSTM are both preferably two-way LSTM, and can be led between The combination of picture and problem in vision question and answer is completed in letter, collaboration, with accurate and quickly output problem answer.
It under normal conditions, is first to input picture and natural language problem in vision question and answer, and then according to natural language problem Pictorial information is focused, is exported to generate a human language as answer.Therefore, it when solving vision question and answer, need to first carry out Above-mentioned steps S101 obtains pictorial information and problem information in vision question and answer.
Further, i.e., executable step S102, analyzes to extract characteristic pictorial information therein.It is optional Ground may include: at least one characteristic area extracted in pictorial information using R-CNN;It is extracted again by ResNet each At least one area characteristic information in characteristic area;Then for each characteristic area, according to degree of overlapping IoU to this feature area Area characteristic information in domain carries out Feature Selection, and the area characteristic information after screening is made average pond, and then obtain every The characteristic of a characteristic area.
The effect of R-CNN further utilizes ResNet exactly in order to find interested characteristic area in pictorial information Area characteristic information is extracted in the characteristic area extracted.
The full name of R-CNN is Region-CNN, is first algorithm being successfully applied to deep learning in target detection. R-CNN is based on convolutional neural networks (CNN), linear regression and support vector machines (SVM) scheduling algorithm, realizes target detection technique.
ResNet is called depth residual error network, and the place different from general network is just the introduction of jump connection, this can be with Make the information of a residual block is unencumbered to be flowed into next residual block, improves information flow, and also avoid By with network too deep caused disappearance gradient problem and degenerate problem.
It, can be according to degree of overlapping IoU to each characteristic area after extracting the area characteristic information after each characteristic area In area characteristic information carry out Feature Selection.Wherein, IoU full name Intersection overUnion is that a kind of measure exists Specific data concentrates a standard of detection respective objects accuracy.It is made comparisons by the value based on IoU with preset threshold, to area Characteristic in domain is screened, to further reduce pictorial information.Finally to the volume in average pond of the feature after screening Feature is accumulated to indicate, the provincial characteristics data of each characteristic area are obtained, in addition, can be spliced to it to obtain question answering system The merging features figure of middle pictorial information, and the output result as R-CNN.
It, can be first to R- before carrying out feature extraction using R-CNN and ResNet in an alternate embodiment of the present invention CNN and ResNet carries out pre-training, and area-of-interest and possible classification are merged.It can specifically include: based on preset data Collection carries out pre-training to R-CNN and/or ResNet;By each provincial characteristics of picture included in preset data collection and represent true The Vector Fusion of real classification;Attribute classification is done into the full articulamentum output that fused vector is transmitted to R-CNN and/or ResNet With the other softmax classification of non-Attribute class.
In the present embodiment, by carrying out pre-training to R-CNN and ResNet, it can allow model can with the parameter of Optimized model To find interested region.Wherein, preset data collection can be preferably COCO, and full name is Common Objects in COntext is the data set that can be used to carry out image recognition.Image in MS COCO data set is divided into training, verifying And test set.COCO collects image by searching for 80 object type and various scene types on a search engine.COCO number According to collection now with 3 kinds of marking types: object instances (object instance), the objectkeypoints (key in target Point) and image captions (picture talk), it is stored using JSON file.Compared to more existing model such as Deeper LSTM Q + norm I model, VIS+LSTM model etc., there is higher accuracy rate in question answering.
After obtaining the characteristic of image information, so that it may complete the knot of problem and picture by two two-way LSTM It closes, and exports answer.That is, can first be pre-processed to the data of picture in the present embodiment institute providing method, first figure Corresponding region is done with corresponding classification and is mapped in piece, and the LSTM of an attention mechanism melts the specific region in word and picture It closes, analyzes the content of specific region, finally the LSTM output result of attention mechanism is taken in semantic LSTM, word is pushed away Reason, which combines, generates problem answers.
Referring to above-mentioned steps S103, picture feature data and problem information are inputted based on attention mechanism simultaneously LSTM passes through divided attention power Weight Acquisition picture content information corresponding with picture feature data.It may include, by picture Characteristic and problem information input the LSTM based on attention mechanism simultaneously;Wherein, the LSTM's based on attention mechanism On each timestamp, input include: the output of semantic LSTM of a upper timestamp, each provincial characteristics data and it is one upper when Between stab the LSTM based on attention mechanism output;It includes: to weigh to each provincial characteristics data divided attention power that it, which is exported, Weight;It is then based on each provincial characteristics data attention Weight Acquisition content letter corresponding with each provincial characteristics data Breath.
On each timestamp, the output in conjunction with two LSTM of a upper timestamp and the picture feature number that extracts According to each cell of attention LSTM can have output, and by the passage of timestamp, can be to all provincial characteristics point With different attention weights, this is the parameter for needing to learn, by each timestamp output and attention weight combine, Output one data for semantic LSTM processing.
Then execute step S104, semantic LSTM is communicated with attention LSTM, with to picture content information with ask Topic information makes inferences combination, obtains and exports the answer of vision question and answer to be processed, may include: by problem information and each region The corresponding content information of characteristic inputs semantic LSTM simultaneously;Wherein, the input of each timestamp of semantic LSTM includes: to pay attention to Hidden layer output, the output of the upper timestamp of semanteme LSTM and a term vector in problem information of power LSTM;It is based on Semantic LSTM output is directed to the answer of problem, and calculated result is output to softmax layers, and the maximum term vector of select probability is made Answer for vision question and answer to be processed is exported.
Wherein, the hidden layer output of attention LSTM, contains attention weight, is that attention LSTM current time stamp is defeated The calculated result that characteristic and attention weight out combines.
The term vector of problems mentioned above, with detect " " indicate sentence ending, its essence is a term vector square Battle array corresponds to the one-hot coding an of word, what term vector was randomly generated, does not pass through pre-training.
The attention LSTM and semanteme LSTM that above-mentioned steps S103 and S104 are related to are two-way functions, as shown in Fig. 2, Assuming that time series is that two two-way LSTM workflows of t={ 1,2 ..., n } may include:
S1-1, t1Provincial characteristics data are inputted t by the moment1The attention LSTM at moment, then by t1Moment attention LSTM Output and problem information in first term vector input t1The semantic LSTM at moment;
S1-2, t2Moment, by t1The output of two LSTM at moment and provincial characteristics data input t2The attention at moment LSTM, then by t2Output of the moment based on attention LSTM, t1Second in the output of the semantic LSTM at moment and problem information Term vector inputs t2The semantic LSTM at moment;
……
S1-n, tnMoment, by tn-1The output of two LSTM at moment and provincial characteristics data input tnThe attention at moment LSTM, then by tnThe output of the attention LSTM at moment, tn-1N-th in the output of the semantic LSTM at moment and problem information Term vector (term vector with, may be shorter than the n moment as end) input tnThe semantic LSTM at moment.
Finally, by tnThe answer of the semantic LSTM output problem at moment.
For example, it is assumed that pictorial information in vision question and answer as shown in figure 3, problem information is " on bed is what ", Then may include: based on vision answering method provided in this embodiment
S2-1 first obtains pictorial information and problem information in this vision question and answer;
S2-2 extracts the characteristic in picture by R-CNN and ResNet, wherein extract in picture and feel by R-CNN Multiple characteristic areas of interest, such as 1 desk of picture region, picture region 2;It is mentioned in each characteristic area by ResNet again Take multiple regions characteristic information;Then the feature in each characteristic area is screened according to degree of overlapping IoU, then to each area Feature after screening in domain makees average pond, and then obtains picture feature data.Wherein, R-CNN and ResNet is by instructing in advance Experienced;
Picture feature data and problem information are inputted attention LSTM simultaneously, obtain picture content information, such as area by S2-3 There are computer, desk lamp, bookshelf, book etc. in domain 1, there is book, medicine etc. in region 2.It further, can also be included in each region Characteristic distributes weight, and if the book in region 2 accounts for heavier ratio, medicine accounts for small percentage etc.;By the output of attention LSTM It is combined with attention weight, exports final picture content information, as obtained in region 2 by distributing after different weights Picture content information is book
S2-4, by semantic LSTM and attention LSTM two-way communication, in conjunction with picture content information and problem information reasoning knot It closes, the result of semantic LSTM final step is output to softmax layers, select probability maximum term vector carries out defeated as answer Out.Problem is directed toward region 2, exports answer books.
Based on the same inventive concept, as shown in figure 4, the embodiment of the present application also provides a kind of views based on multiple attention Feel question answering system 400, comprising:
Data obtaining module 410 is configured to obtain pictorial information and and pictorial information in vision question and answer to be processed Corresponding problem information;
Picture feature extraction module 420 is configured to that the picture feature data in pictorial information will be extracted;
Image content obtains module 430, is configured to picture feature data and described problem information while input is based on The LSTM of attention mechanism passes through divided attention power Weight Acquisition picture content information corresponding with picture feature data;
Answer output module 440 is configured to communicate semantic LSTM with attention LSTM, to believe image content Breath and described problem information make inferences combination, obtain and export the answer of vision question and answer to be processed.
In an alternate embodiment of the present invention, picture feature extraction module 420 is configured to:
Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information;Pass through ResNet extracts at least one area characteristic information in each characteristic area;For each characteristic area, according to overlapping It spends IoU and Feature Selection is carried out to the area characteristic information in this feature region, and the area characteristic information after screening is averaged Chi Hua, and then obtain the provincial characteristics data of each characteristic area.
In an alternate embodiment of the present invention, as shown in figure 5, above system can also include:
Pre-training module 450 is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection; By each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification;It will be fused Attribute classification and the other softmax of non-Attribute class points are done in the full articulamentum output that vector is transmitted to the R-CNN and/or ResNet Class.
In an alternate embodiment of the present invention, image content obtains module 430, is configured to:
Picture feature data are inputted into the LSTM based on attention mechanism;Wherein, in each timestamp of attention LSTM On, input includes: output, each provincial characteristics data and a upper timestamp attention of the semantic LSTM of a upper timestamp The output of LSTM;Its output includes: to each provincial characteristics data divided attention power weight;
Based on each provincial characteristics data attention Weight Acquisition content information corresponding with each provincial characteristics data.
In an alternate embodiment of the present invention, answer output module 440 is configured to:
Problem information and the corresponding content information of each provincial characteristics data are inputted into semantic LSTM simultaneously;Wherein, semantic The input of each timestamp of LSTM include: attention LSTM hidden layer output, the upper timestamp of semanteme LSTM output with An and term vector in problem information;
It is directed to the answer of problem based on semantic LSTM output, and calculated result is output to softmax layers, select probability is most Big term vector is exported as the answer of the vision question and answer to be processed.
The embodiment of the present application combines computer vision (CV) and the big field of natural language processing (NLP) two, provides one Vision answering method and system of the kind based on multiple attention first obtain to be processed in method provided by the embodiments of the present application Pictorial information and corresponding problem information in vision question and answer extract the picture feature data in pictorial information, then by picture Characteristic and problem information input the first long memory network in short-term based on attention mechanism simultaneously, are weighed by divided attention power It recaptures and takes picture content information, then by the combination of two two-way length memory network completion in short-term problem and picture, export wait locate Manage the answer of vision question and answer.
Based on the vision answering method and system provided by the embodiments of the present application based on multiple attention, in R-CNN network Calculating process in increase some memory modules to improve the knowledge source of model in the training process, generate more diversified and reasonable Answer, fusion relevant to problem priori knowledge while extract problem information that memory network is realized end to end, raising asks The whole accuracy rate that topic is answered.
According to the accompanying drawings to the detailed description of the specific embodiment of the application, those skilled in the art will be more Above-mentioned and other purposes, the advantages and features of the application are illustrated.
The embodiment of the present application also provides a kind of calculating equipment, and referring to Fig. 6, which includes memory 620, processing Device 610 and it is stored in the computer program that can be run in the memory 620 and by the processor 610, the computer program It is stored in the space 630 for program code in memory 620, which realizes when being executed by processor 610 For executing any one steps of a method in accordance with the invention 631.
The embodiment of the present application also provides a kind of computer readable storage mediums.Referring to Fig. 7, the computer-readable storage medium Matter includes the storage unit for program code, which is provided with the journey for executing steps of a method in accordance with the invention Sequence 631 ', the program are executed by processor.
The embodiment of the present application also provides a kind of computer program products comprising instruction.When the computer program product exists When being run on computer, so that computer executes steps of a method in accordance with the invention.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When computer loads and executes the computer program instructions, whole or portion Ground is divided to generate according to process or function described in the embodiment of the present application.The computer can be general purpose computer, dedicated computing Machine, computer network obtain other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It is not considered that exceeding scope of the present application.
Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with By program come instruction processing unit completion, the program be can store in computer readable storage medium, and the storage is situated between Matter is non-transitory (English: non-transitory) medium, such as random access memory, read-only memory, flash Device, hard disk, solid state hard disk, tape (English: magnetic tape), floppy disk (English: floppy disk), CD (English: Optical disc) and any combination thereof.
The preferable specific embodiment of the above, only the application, but the protection scope of the application is not limited thereto, Within the technical scope of the present application, any changes or substitutions that can be easily thought of by anyone skilled in the art, Should all it cover within the scope of protection of this application.Therefore, the protection scope of the application should be with scope of protection of the claims Subject to.

Claims (10)

1. a kind of vision answering method based on multiple attention, comprising:
Obtain the pictorial information and problem information corresponding with the pictorial information in vision question and answer to be processed;
Extract the picture feature data in the pictorial information;
The picture feature data and described problem information are inputted into the first long short-term memory net based on attention mechanism simultaneously Network passes through divided attention power Weight Acquisition picture content information corresponding with the picture feature data;
The the second long memory network in short-term for executing semantic analysis is communicated with the described first long memory network in short-term, to institute It states picture content information and described problem information makes inferences combination, obtain and export the answer of the vision question and answer to be processed.
2. the method according to claim 1, wherein the picture feature number extracted in the pictorial information According to, comprising:
Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information;
At least one area characteristic information in each characteristic area is extracted by ResNet;
For each characteristic area, Feature Selection is carried out to the area characteristic information in this feature region according to degree of overlapping IoU, and Area characteristic information after screening is made into average pond, and then obtains the provincial characteristics data of each characteristic area.
3. according to the method described in claim 2, it is characterized in that, the picture feature data extracted in the pictorial information Before, further includes:
Pre-training is carried out to the R-CNN and/or ResNet based on preset data collection;
By each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification;
Attribute classification and non-attribute are done into the full articulamentum output that fused vector is transmitted to the R-CNN and/or ResNet The softmax of classification classifies.
4. according to the method described in claim 2, it is characterized in that, described by the picture feature data and described problem information First long in short-term memory network of the input simultaneously based on attention mechanism, it is special by divided attention power Weight Acquisition and the picture Levy the corresponding picture content information of data, comprising:
The picture feature data and described problem information are inputted into the first long short-term memory net based on attention mechanism simultaneously Network;Wherein, on described first long each timestamp of memory network in short-term, input includes: the second of a upper timestamp The long output of memory network in short-term, the output of each provincial characteristics data and the long memory network in short-term of a upper timestamp first;Its Output includes: to each provincial characteristics data divided attention power weight;
Believed based on each provincial characteristics data attention Weight Acquisition content corresponding with each provincial characteristics data Breath.
5. according to the method described in claim 4, it is characterized in that, the second long short-term memory net that semantic analysis will be executed Network is communicated with the described first long memory network in short-term, to make inferences to the picture content information and described problem information In conjunction with obtaining and export the answer of the vision question and answer to be processed, comprising:
Described problem information and the corresponding content information of each provincial characteristics data are inputted simultaneously and execute the of semantic analysis Two long memory networks in short-term;Wherein, the input of the described second long each timestamp of memory network in short-term includes: that the first length is remembered in short-term Recall the hidden layer output of network, on the described second long memory network in short-term in the output and described problem information of a timestamp A term vector;
It is directed to the answer of problem based on the described second long output of memory network in short-term, and calculated result is output to softmax layers, The maximum term vector of select probability is exported as the answer of the vision question and answer to be processed.
6. a kind of vision question answering system based on multiple attention, comprising:
Data obtaining module, the pictorial information for being configured to obtain in vision question and answer to be processed and corresponding with the pictorial information The problem of information;
Picture feature extraction module is configured to extract the picture feature data in the pictorial information;
Image content obtains module, is configured to the picture feature data and described problem information while input is based on attention The long memory network in short-term of the first of power mechanism, passes through divided attention power Weight Acquisition picture corresponding with the picture feature data Content information;
Answer output module, the second long memory network in short-term for being configured to execute semantic analysis are remembered in short-term with first length Recall network to be communicated, to make inferences combination to the picture content information and described problem information, obtains and export described The answer of vision question and answer to be processed.
7. system according to claim 6, which is characterized in that the picture feature extraction module is configured to:
Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information;
At least one area characteristic information in each characteristic area is extracted by ResNet;
For each characteristic area, Feature Selection is carried out to the area characteristic information in this feature region according to degree of overlapping IoU, and Area characteristic information after screening is made into average pond, and then obtains the provincial characteristics data of each characteristic area.
8. system according to claim 6, which is characterized in that further include:
Pre-training module is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection;It will be described Each provincial characteristics of picture included in preset data collection and the Vector Fusion for representing true classification;Fused vector is passed Attribute classification and the other softmax classification of non-Attribute class are done in the full articulamentum output for transporting to the R-CNN and/or ResNet.
9. system according to claim 6, which is characterized in that the image content obtains module, is configured to:
The picture feature data are inputted to the first long memory network in short-term based on attention mechanism;Wherein, described first On long each timestamp of memory network in short-term, input includes: the defeated of the second long memory network in short-term of a upper timestamp Out, the output of each provincial characteristics data and the long memory network in short-term of a upper timestamp first;Its output includes: to each described Provincial characteristics data divided attention power weight;
Believed based on each provincial characteristics data attention Weight Acquisition content corresponding with each provincial characteristics data Breath.
10. system according to claim 6, which is characterized in that the answer output module is configured to:
Described problem information and the corresponding content information of each provincial characteristics data are inputted simultaneously and execute the of semantic analysis Two long memory networks in short-term;Wherein, the input of the described second long each timestamp of memory network in short-term includes: that the first length is remembered in short-term Recall the hidden layer output of network, on the described second long memory network in short-term in the output and described problem information of a timestamp A term vector;
It is directed to the answer of problem based on the described second long output of memory network in short-term, and calculated result is output to softmax layers, The maximum term vector of select probability is exported as the answer of the vision question and answer to be processed.
CN201910770172.XA 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention Active CN110516791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910770172.XA CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910770172.XA CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Publications (2)

Publication Number Publication Date
CN110516791A true CN110516791A (en) 2019-11-29
CN110516791B CN110516791B (en) 2022-04-22

Family

ID=68627077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910770172.XA Active CN110516791B (en) 2019-08-20 2019-08-20 Visual question-answering method and system based on multiple attention

Country Status (1)

Country Link
CN (1) CN110516791B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463936A (en) * 2020-09-24 2021-03-09 北京影谱科技股份有限公司 Visual question answering method and system based on three-dimensional information
CN112559877A (en) * 2020-12-24 2021-03-26 齐鲁工业大学 CTR (China railway) estimation method and system based on cross-platform heterogeneous data and behavior context
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113032535A (en) * 2019-12-24 2021-06-25 中国移动通信集团浙江有限公司 Visual question and answer method and device for assisting visually impaired people, computing equipment and storage medium
CN113283246A (en) * 2021-06-15 2021-08-20 咪咕文化科技有限公司 Visual interaction method, device, equipment and storage medium
CN113590879A (en) * 2021-08-05 2021-11-02 哈尔滨理工大学 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
CN113590770A (en) * 2020-04-30 2021-11-02 北京京东乾石科技有限公司 Point cloud data-based response method, device, equipment and storage medium
CN116881427A (en) * 2023-09-05 2023-10-13 腾讯科技(深圳)有限公司 Question-answering processing method and device, electronic equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
CN108829756A (en) * 2018-05-25 2018-11-16 杭州知智能科技有限公司 A method of more wheel video question and answer are solved using layering attention context network
US20190130206A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. Interpretable counting in visual question answering
CN109766427A (en) * 2019-01-15 2019-05-17 重庆邮电大学 A kind of collaborative virtual learning environment intelligent answer method based on stacking Bi-LSTM network and collaboration attention
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN109857909A (en) * 2019-01-22 2019-06-07 杭州一知智能科技有限公司 The method that more granularity convolution solve video conversation task from attention context network
CN109902164A (en) * 2019-03-06 2019-06-18 杭州一知智能科技有限公司 It is two-way from the method for noticing that network solves open long format video question and answer using convolution
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN110110043A (en) * 2019-04-11 2019-08-09 中山大学 A kind of multi-hop visual problem inference pattern and its inference method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
CN107076567A (en) * 2015-05-21 2017-08-18 百度(美国)有限责任公司 Multilingual image question and answer
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
US20190130206A1 (en) * 2017-10-27 2019-05-02 Salesforce.Com, Inc. Interpretable counting in visual question answering
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
KR20190092043A (en) * 2018-01-30 2019-08-07 연세대학교 산학협력단 Visual Question Answering Apparatus for Explaining Reasoning Process and Method Thereof
CN108829756A (en) * 2018-05-25 2018-11-16 杭州知智能科技有限公司 A method of more wheel video question and answer are solved using layering attention context network
CN109766427A (en) * 2019-01-15 2019-05-17 重庆邮电大学 A kind of collaborative virtual learning environment intelligent answer method based on stacking Bi-LSTM network and collaboration attention
CN109857909A (en) * 2019-01-22 2019-06-07 杭州一知智能科技有限公司 The method that more granularity convolution solve video conversation task from attention context network
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN109902164A (en) * 2019-03-06 2019-06-18 杭州一知智能科技有限公司 It is two-way from the method for noticing that network solves open long format video question and answer using convolution
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110110043A (en) * 2019-04-11 2019-08-09 中山大学 A kind of multi-hop visual problem inference pattern and its inference method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Z. YANG ET AL.: "Stacked Attention Networks for Image Question Answering", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 》 *
王一蕾 等: "基于深度神经网络的图像碎片化信息问答算法", 《计算机研究与发展》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032535A (en) * 2019-12-24 2021-06-25 中国移动通信集团浙江有限公司 Visual question and answer method and device for assisting visually impaired people, computing equipment and storage medium
CN113590770A (en) * 2020-04-30 2021-11-02 北京京东乾石科技有限公司 Point cloud data-based response method, device, equipment and storage medium
CN113590770B (en) * 2020-04-30 2024-03-08 北京京东乾石科技有限公司 Response method, device, equipment and storage medium based on point cloud data
CN112463936A (en) * 2020-09-24 2021-03-09 北京影谱科技股份有限公司 Visual question answering method and system based on three-dimensional information
CN112463936B (en) * 2020-09-24 2024-06-07 北京影谱科技股份有限公司 Visual question-answering method and system based on three-dimensional information
CN112559877A (en) * 2020-12-24 2021-03-26 齐鲁工业大学 CTR (China railway) estimation method and system based on cross-platform heterogeneous data and behavior context
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113283246A (en) * 2021-06-15 2021-08-20 咪咕文化科技有限公司 Visual interaction method, device, equipment and storage medium
CN113283246B (en) * 2021-06-15 2024-01-30 咪咕文化科技有限公司 Visual interaction method, device, equipment and storage medium
CN113590879A (en) * 2021-08-05 2021-11-02 哈尔滨理工大学 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
CN116881427A (en) * 2023-09-05 2023-10-13 腾讯科技(深圳)有限公司 Question-answering processing method and device, electronic equipment and storage medium
CN116881427B (en) * 2023-09-05 2023-12-01 腾讯科技(深圳)有限公司 Question-answering processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110516791B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN110516791A (en) A kind of vision answering method and system based on multiple attention
CN106982359B (en) Binocular video monitoring method and system and computer readable storage medium
Rohrbeck Trend scanning, scouting and foresight techniques
CN110032630A (en) Talk about art recommendation apparatus, method and model training equipment
CN110222171A (en) A kind of application of disaggregated model, disaggregated model training method and device
Wang et al. Learning performance prediction via convolutional GRU and explainable neural networks in e-learning environments
CN110139067A (en) A kind of wild animal monitoring data management information system
Shahaf et al. Information cartography
Gambo et al. An artificial neural network (ann)-based learning agent for classifying learning styles in self-regulated smart learning environment
Liu et al. Hybrid design for sports data visualization using AI and big data analytics
CN116741411A (en) Intelligent health science popularization recommendation method and system based on medical big data analysis
Crockett et al. A fuzzy model for predicting learning styles using behavioral cues in an conversational intelligent tutoring system
Wang Exploration of data mining algorithms of an online learning behaviour log based on cloud computing
CN108604313A (en) The predictive modeling of automation and frame
CN115840796A (en) Event integration method, device, equipment and computer readable storage medium
Thoring et al. Toward a unified model of design knowledge
Sullivan et al. The mathematical corporation: Where machine intelligence and human ingenuity achieve the impossible
CN110908919A (en) Response test system based on artificial intelligence and application thereof
Obaid et al. Data-mining based novel neural-networks-hierarchical attention structures for obtaining an optimal efficiency
Tanna Decision support system for admission in engineering colleges based on entrance exam marks
Chinnasami Sivaji et al. A Review on Weight Process Method and its Classification
CN113761337B (en) Event prediction method and device based on implicit event element and explicit connection
Cheregi et al. Branding Romania in the age of disruption: Technology as a soft power instrument
CN113821610A (en) Information matching method, device, equipment and storage medium
Kersten Leveling the field: Talking levels in cognitive science

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Visual Q&A Method and System Based on Multiple Attention

Effective date of registration: 20230713

Granted publication date: 20220422

Pledgee: Bank of Jiangsu Limited by Share Ltd. Beijing branch

Pledgor: BEIJING MOVIEBOOK SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2023110000278