CN110516791A

CN110516791A - A kind of vision answering method and system based on multiple attention

Info

Publication number: CN110516791A
Application number: CN201910770172.XA
Authority: CN
Inventors: 刘伟
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2019-11-29
Anticipated expiration: 2039-08-20
Also published as: CN110516791B

Abstract

This application discloses a kind of vision answering methods and system based on multiple attention, in method provided by the present application, first obtain the pictorial information and corresponding problem information in vision question and answer to be processed, extract the picture feature data in pictorial information, then picture feature data and problem information are inputted to the first long memory network in short-term based on attention mechanism simultaneously, pass through divided attention power Weight Acquisition picture content information, again by the combination of two two-way length memory network completion in short-term problem and picture, the answer of vision question and answer to be processed is exported.Based on the vision answering method and system provided by the present application based on multiple attention, increase some memory modules in the calculating process of R-CNN network to improve the knowledge source of model in the training process, generate more diversified and reasonable answer, fusion priori knowledge relevant to problem, improves the whole accuracy rate in question answering while memory network realizes extraction problem information end to end.

Description

A kind of vision answering method and system based on multiple attention

Technical field

This application involves vision question and answer fields, more particularly to a kind of vision answering method based on multiple attention and are System.

Background technique

Vision question and answer be it is a kind of be related to the learning tasks of computer vision and natural language processing, exactly allow computer learning The picture and problem of input export one and meet natural language rule and the logical answer of content, it is according to the difference of problem The object with certain a part in picture is only focused, and certain problems need certain commonsense reasoning that can just obtain answer, so, Vision question and answer require higher, also facing bigger challenge compared to general picture talk on the semantic understanding to image.

The existing model in vision question and answer field has Deeper LSTM Q+norm I model and VIS+LSTM model etc. at present, but Other than having higher accuracy rate on the simple problem for answering single answer, the accuracy rate of other aspect models is generally relatively low, Structure is also relatively easy, and the content and form of answer is relatively simple, the slightly complicated more priori knowledges of needs is carried out simple The problem of reasoning, can not make correct answer.

Summary of the invention

Aiming to overcome that the above problem or at least being partially solved or extenuate for the application solves the above problems.

According to the one aspect of the application, a kind of vision answering method based on multiple attention is provided, comprising:

Obtain the pictorial information and problem information corresponding with the pictorial information in vision question and answer to be processed；

Extract the picture feature data in the pictorial information；

The picture feature data and described problem information are inputted the first length based on attention mechanism simultaneously to remember in short-term Recall network, passes through divided attention power Weight Acquisition picture content information corresponding with the picture feature data；

The the second long memory network in short-term for executing semantic analysis is communicated with the described first long memory network in short-term, with Combination is made inferences to the picture content information and described problem information, obtain and exports answering for the vision question and answer to be processed Case.

Optionally, the picture feature data extracted in the pictorial information, comprising:

Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information；

At least one area characteristic information in each characteristic area is extracted by ResNet；

For each characteristic area, feature sieve is carried out to the area characteristic information in this feature region according to degree of overlapping IoU Choosing, and the area characteristic information after screening is made into average pond, and then obtain the provincial characteristics data of each characteristic area.

Optionally, before the picture feature data extracted in the pictorial information, further includes:

Pre-training is carried out to the R-CNN and/or ResNet based on preset data collection；

By each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification；

Attribute classification and non-is done into the full articulamentum output that fused vector is transmitted to the R-CNN and/or ResNet The other softmax classification of Attribute class.

Optionally, the picture feature data and described problem information inputted simultaneously based on attention mechanism One long memory network in short-term, by divided attention power Weight Acquisition picture content information corresponding with the picture feature data, Include:

The picture feature data and described problem information are inputted the first length based on attention mechanism simultaneously to remember in short-term Recall network；Wherein, on described first long each timestamp of memory network in short-term, input includes: a upper timestamp Second grows the defeated of the long memory network in short-term of the output of memory network in short-term, each provincial characteristics data and a upper timestamp first Out；Its output includes: to each provincial characteristics data divided attention power weight；

It is corresponding with each provincial characteristics data interior based on each provincial characteristics data attention Weight Acquisition Hold information.

Optionally, the second long memory network in short-term and the first long memory network in short-term that semantic analysis will be executed It is communicated, to make inferences combination to the picture content information and described problem information, obtains and export described to be processed The answer of vision question and answer, comprising:

Described problem information and the corresponding content information of each provincial characteristics data are inputted into execution semantic analysis simultaneously The second long memory network in short-term；Wherein, the input of the described second long each timestamp of memory network in short-term includes: the first length When the hidden layer output of memory network, the output of timestamp and described problem letter on the described second long memory network in short-term A term vector in breath；

It is directed to the answer of problem based on the described second long output of memory network in short-term, and calculated result is output to Softmax layers, the maximum term vector of select probability is exported as the answer of the vision question and answer to be processed.

According to further aspect of the application, a kind of vision question answering system based on multiple attention is provided, comprising:

Data obtaining module, the pictorial information for being configured to obtain in vision question and answer to be processed and with the pictorial information Corresponding problem information；

Picture feature extraction module is configured to extract the picture feature data in the pictorial information；

Image content obtains module, is configured to the picture feature data and described problem information while input is based on The long memory network in short-term of the first of attention mechanism, it is corresponding with the picture feature data by divided attention power Weight Acquisition Picture content information；

Answer output module is configured to that the second long memory network in short-term of semantic analysis and first length will be executed When memory network communicated, making inferences combination to the picture content information and described problem information, obtaining and exporting The answer of the vision question and answer to be processed.

Optionally, the picture feature extraction module, is configured to:

Optionally, the system also includes:

Pre-training module is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection；It will Each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification；By it is fused to Attribute classification and the other softmax of non-Attribute class points are done in the full articulamentum output that amount is transmitted to the R-CNN and/or ResNet Class.

Optionally, the image content obtains module, is configured to:

The picture feature data are inputted to the first long memory network in short-term based on attention mechanism；Wherein, described On first long each timestamp of memory network in short-term, input includes: the second long memory network in short-term of a upper timestamp Output, the long memory network in short-term of each provincial characteristics data and a upper timestamp first output；Its output includes: to each The provincial characteristics data divided attention power weight；

Optionally, the answer output module, is configured to:

This application provides a kind of vision answering methods and system based on multiple attention, in method provided by the present application In, the pictorial information and corresponding problem information in vision question and answer to be processed are first obtained, the picture feature in pictorial information is extracted Then picture feature data and problem information are inputted the first long memory network in short-term based on attention mechanism by data simultaneously, By divided attention power Weight Acquisition picture content information, then by two two-way length, memory network completes problem and picture in short-term Combination, export the answer of vision question and answer to be processed.

Based on the vision answering method and system provided by the present application based on multiple attention, in the calculating of R-CNN network Increase some memory modules in the process to improve the knowledge source of model in the training process, generate more diversified and reasonably answers Case, fusion priori knowledge relevant to problem while memory network realizes extraction problem information end to end, improves problem and returns The whole accuracy rate answered.

According to the accompanying drawings to the detailed description of the specific embodiment of the application, those skilled in the art will be more Above-mentioned and other purposes, the advantages and features of the application are illustrated.

Detailed description of the invention

Some specific embodiments of the application are described in detail by way of example and not limitation with reference to the accompanying drawings hereinafter. Identical appended drawing reference denotes same or similar part or part in attached drawing.It should be appreciated by those skilled in the art that these What attached drawing was not necessarily drawn to scale.In attached drawing:

Fig. 1 is the vision answering method flow diagram based on multiple attention according to the embodiment of the present application；

Fig. 2 is the two-way LSTM workflow schematic diagram according to the embodiment of the present application；

Fig. 3 is pictorial information schematic diagram in the vision question and answer according to the embodiment of the present application；

Fig. 4 is the vision question answering system structural schematic diagram based on multiple attention according to the embodiment of the present application.

Fig. 5 is the vision question answering system structural schematic diagram based on multiple attention according to the application preferred embodiment；

Fig. 6 is the calculating equipment schematic diagram according to the embodiment of the present application；

Fig. 7 is the computer readable storage medium schematic diagram according to this social situation embodiment.

Specific embodiment

Fig. 1 is the vision answering method flow diagram based on multiple attention according to the embodiment of the present application.Referring to Fig. 1 Known, the vision answering method provided by the embodiments of the present application based on multiple attention may include:

Step S101: the pictorial information and problem information corresponding with pictorial information in vision question and answer to be processed are obtained；

Step S102: the picture feature data in pictorial information are extracted；

Step S103: above-mentioned picture feature data and problem information are inputted into the first length based on attention mechanism simultaneously When memory network, pass through divided attention power Weight Acquisition picture content information corresponding with picture feature data；

Step S104: the second long memory network in short-term for executing semantic analysis is led to the first long memory network in short-term Letter, to make inferences combination to picture content information and problem information, obtains and exports the answer of vision question and answer to be processed.

The embodiment of the present application provides a kind of vision answering method based on multiple attention, provides in the embodiment of the present application Method in, first obtain the pictorial information and corresponding problem information in vision question and answer to be processed, extract the figure in pictorial information Then picture feature data and problem information are inputted the first long short-term memory based on attention mechanism by piece characteristic simultaneously Network, by divided attention power Weight Acquisition picture content information, then by two two-way length, memory network completes problem in short-term With the combination of picture, the answer of vision question and answer to be processed is exported.

Long memory network (Long Short-Term Memory, be abbreviated as LSTM) in short-term, is a kind of time recurrent neural Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence, and the system based on LSTM can To learn interpreter language, control robot, image analysis, documentation summary, speech recognition, image recognition, handwriting recognition, control are chatted Its robot, predictive disease, clicking rate and stock, composite music etc. task.

For traditional vision Question-Answering Model Deeper LSTM Q+norm I model, wherein I is indicated after extracting Picture feature, norm I indicate that extracted by CNN 1024 dimension pixel semantic vectors do L2 normalized.Image is extracted by CNN Semantic information, then the text semantic information for including in problem is taken by LSTM, merges the data of two networks, allows model learning To the meaning of problem, a multilayer MLP is finally sent into as Softmax output layer and generates answer output.Input contains one The image of 2 dry goods and 2 people in outdoor scene, these data pass through the CNN processing without classification layer, the semanteme of problematic portion Problem information is then extracted according to the input sequence of problem word by a RNN network in turn, then by two compressed letters Breath fusion is sent into treated data in MLP and generates result output (for example current problem is that counting (count object) is asked Topic).The model divides region to image using 2 layers of LSTM encoded question and with VGGNet model, and characteristics of image is then made L2 Normalization.Image and problem characteristic are transformed into the same feature space later, merged information by way of dot product, is merged Information afterwards is sent into one using Softmax to generate answer output in three layers of MLP of classifier.Model during training, Pre-CNN, only LSTM layers and last sorter network participation training.

VIS+LSTM model, basic structure are to extract pictorial information using CNN first, and the LSTM that is followed by of completion is generated in advance Survey result.But it considers the standard due to none intact evaluation answer sentence accuracy, they are by attention Limited field question is concentrated on, these problems can use a word as the answer of vision question and answer, thus can be view Feel that question and answer are considered as classification problem more than one, so as to measure answer using existing accuracy evaluation criterion.

The whole accuracy of above-mentioned mentioned each model is not high, and the content and form of answer is relatively simple.

In the present embodiment, the first long memory network in short-term based on attention mechanism (may be simply referred to as attention below LSTM), one model of training to carry out list entries the study of selectivity and in model output mainly in LSTM The relevant information of correspondence considered in input can be selectively absorbed in.The long memory network in short-term of the second of execution semantic analysis (with Under may be simply referred to as semantic LSTM), the profound concept with learning text, picture etc. is excavated mainly in LSTM, can avoid ladder Spend disappearance problem.Wherein, attention LSTM and semanteme LSTM are both preferably two-way LSTM, and can be led between The combination of picture and problem in vision question and answer is completed in letter, collaboration, with accurate and quickly output problem answer.

It under normal conditions, is first to input picture and natural language problem in vision question and answer, and then according to natural language problem Pictorial information is focused, is exported to generate a human language as answer.Therefore, it when solving vision question and answer, need to first carry out Above-mentioned steps S101 obtains pictorial information and problem information in vision question and answer.

Further, i.e., executable step S102, analyzes to extract characteristic pictorial information therein.It is optional Ground may include: at least one characteristic area extracted in pictorial information using R-CNN；It is extracted again by ResNet each At least one area characteristic information in characteristic area；Then for each characteristic area, according to degree of overlapping IoU to this feature area Area characteristic information in domain carries out Feature Selection, and the area characteristic information after screening is made average pond, and then obtain every The characteristic of a characteristic area.

The effect of R-CNN further utilizes ResNet exactly in order to find interested characteristic area in pictorial information Area characteristic information is extracted in the characteristic area extracted.

The full name of R-CNN is Region-CNN, is first algorithm being successfully applied to deep learning in target detection. R-CNN is based on convolutional neural networks (CNN), linear regression and support vector machines (SVM) scheduling algorithm, realizes target detection technique.

ResNet is called depth residual error network, and the place different from general network is just the introduction of jump connection, this can be with Make the information of a residual block is unencumbered to be flowed into next residual block, improves information flow, and also avoid By with network too deep caused disappearance gradient problem and degenerate problem.

It, can be according to degree of overlapping IoU to each characteristic area after extracting the area characteristic information after each characteristic area In area characteristic information carry out Feature Selection.Wherein, IoU full name Intersection overUnion is that a kind of measure exists Specific data concentrates a standard of detection respective objects accuracy.It is made comparisons by the value based on IoU with preset threshold, to area Characteristic in domain is screened, to further reduce pictorial information.Finally to the volume in average pond of the feature after screening Feature is accumulated to indicate, the provincial characteristics data of each characteristic area are obtained, in addition, can be spliced to it to obtain question answering system The merging features figure of middle pictorial information, and the output result as R-CNN.

It, can be first to R- before carrying out feature extraction using R-CNN and ResNet in an alternate embodiment of the present invention CNN and ResNet carries out pre-training, and area-of-interest and possible classification are merged.It can specifically include: based on preset data Collection carries out pre-training to R-CNN and/or ResNet；By each provincial characteristics of picture included in preset data collection and represent true The Vector Fusion of real classification；Attribute classification is done into the full articulamentum output that fused vector is transmitted to R-CNN and/or ResNet With the other softmax classification of non-Attribute class.

In the present embodiment, by carrying out pre-training to R-CNN and ResNet, it can allow model can with the parameter of Optimized model To find interested region.Wherein, preset data collection can be preferably COCO, and full name is Common Objects in COntext is the data set that can be used to carry out image recognition.Image in MS COCO data set is divided into training, verifying And test set.COCO collects image by searching for 80 object type and various scene types on a search engine.COCO number According to collection now with 3 kinds of marking types: object instances (object instance), the objectkeypoints (key in target Point) and image captions (picture talk), it is stored using JSON file.Compared to more existing model such as Deeper LSTM Q + norm I model, VIS+LSTM model etc., there is higher accuracy rate in question answering.

After obtaining the characteristic of image information, so that it may complete the knot of problem and picture by two two-way LSTM It closes, and exports answer.That is, can first be pre-processed to the data of picture in the present embodiment institute providing method, first figure Corresponding region is done with corresponding classification and is mapped in piece, and the LSTM of an attention mechanism melts the specific region in word and picture It closes, analyzes the content of specific region, finally the LSTM output result of attention mechanism is taken in semantic LSTM, word is pushed away Reason, which combines, generates problem answers.

Referring to above-mentioned steps S103, picture feature data and problem information are inputted based on attention mechanism simultaneously LSTM passes through divided attention power Weight Acquisition picture content information corresponding with picture feature data.It may include, by picture Characteristic and problem information input the LSTM based on attention mechanism simultaneously；Wherein, the LSTM's based on attention mechanism On each timestamp, input include: the output of semantic LSTM of a upper timestamp, each provincial characteristics data and it is one upper when Between stab the LSTM based on attention mechanism output；It includes: to weigh to each provincial characteristics data divided attention power that it, which is exported, Weight；It is then based on each provincial characteristics data attention Weight Acquisition content letter corresponding with each provincial characteristics data Breath.

On each timestamp, the output in conjunction with two LSTM of a upper timestamp and the picture feature number that extracts According to each cell of attention LSTM can have output, and by the passage of timestamp, can be to all provincial characteristics point With different attention weights, this is the parameter for needing to learn, by each timestamp output and attention weight combine, Output one data for semantic LSTM processing.

Then execute step S104, semantic LSTM is communicated with attention LSTM, with to picture content information with ask Topic information makes inferences combination, obtains and exports the answer of vision question and answer to be processed, may include: by problem information and each region The corresponding content information of characteristic inputs semantic LSTM simultaneously；Wherein, the input of each timestamp of semantic LSTM includes: to pay attention to Hidden layer output, the output of the upper timestamp of semanteme LSTM and a term vector in problem information of power LSTM；It is based on Semantic LSTM output is directed to the answer of problem, and calculated result is output to softmax layers, and the maximum term vector of select probability is made Answer for vision question and answer to be processed is exported.

Wherein, the hidden layer output of attention LSTM, contains attention weight, is that attention LSTM current time stamp is defeated The calculated result that characteristic and attention weight out combines.

The term vector of problems mentioned above, with detect " " indicate sentence ending, its essence is a term vector square Battle array corresponds to the one-hot coding an of word, what term vector was randomly generated, does not pass through pre-training.

The attention LSTM and semanteme LSTM that above-mentioned steps S103 and S104 are related to are two-way functions, as shown in Fig. 2, Assuming that time series is that two two-way LSTM workflows of t={ 1,2 ..., n } may include:

S1-1, t₁Provincial characteristics data are inputted t by the moment₁The attention LSTM at moment, then by t₁Moment attention LSTM Output and problem information in first term vector input t₁The semantic LSTM at moment；

S1-2, t₂Moment, by t₁The output of two LSTM at moment and provincial characteristics data input t₂The attention at moment LSTM, then by t₂Output of the moment based on attention LSTM, t₁Second in the output of the semantic LSTM at moment and problem information Term vector inputs t₂The semantic LSTM at moment；

……

S1-n, t_nMoment, by t_n-1The output of two LSTM at moment and provincial characteristics data input t_nThe attention at moment LSTM, then by t_nThe output of the attention LSTM at moment, t_n-1N-th in the output of the semantic LSTM at moment and problem information Term vector (term vector with, may be shorter than the n moment as end) input t_nThe semantic LSTM at moment.

Finally, by t_nThe answer of the semantic LSTM output problem at moment.

For example, it is assumed that pictorial information in vision question and answer as shown in figure 3, problem information is " on bed is what ", Then may include: based on vision answering method provided in this embodiment

S2-1 first obtains pictorial information and problem information in this vision question and answer；

S2-2 extracts the characteristic in picture by R-CNN and ResNet, wherein extract in picture and feel by R-CNN Multiple characteristic areas of interest, such as 1 desk of picture region, picture region 2；It is mentioned in each characteristic area by ResNet again Take multiple regions characteristic information；Then the feature in each characteristic area is screened according to degree of overlapping IoU, then to each area Feature after screening in domain makees average pond, and then obtains picture feature data.Wherein, R-CNN and ResNet is by instructing in advance Experienced；

Picture feature data and problem information are inputted attention LSTM simultaneously, obtain picture content information, such as area by S2-3 There are computer, desk lamp, bookshelf, book etc. in domain 1, there is book, medicine etc. in region 2.It further, can also be included in each region Characteristic distributes weight, and if the book in region 2 accounts for heavier ratio, medicine accounts for small percentage etc.；By the output of attention LSTM It is combined with attention weight, exports final picture content information, as obtained in region 2 by distributing after different weights Picture content information is book

S2-4, by semantic LSTM and attention LSTM two-way communication, in conjunction with picture content information and problem information reasoning knot It closes, the result of semantic LSTM final step is output to softmax layers, select probability maximum term vector carries out defeated as answer Out.Problem is directed toward region 2, exports answer books.

Based on the same inventive concept, as shown in figure 4, the embodiment of the present application also provides a kind of views based on multiple attention Feel question answering system 400, comprising:

Data obtaining module 410 is configured to obtain pictorial information and and pictorial information in vision question and answer to be processed Corresponding problem information；

Picture feature extraction module 420 is configured to that the picture feature data in pictorial information will be extracted；

Image content obtains module 430, is configured to picture feature data and described problem information while input is based on The LSTM of attention mechanism passes through divided attention power Weight Acquisition picture content information corresponding with picture feature data；

Answer output module 440 is configured to communicate semantic LSTM with attention LSTM, to believe image content Breath and described problem information make inferences combination, obtain and export the answer of vision question and answer to be processed.

In an alternate embodiment of the present invention, picture feature extraction module 420 is configured to:

Using area convolutional neural networks R-CNN extracts at least one characteristic area in the pictorial information；Pass through ResNet extracts at least one area characteristic information in each characteristic area；For each characteristic area, according to overlapping It spends IoU and Feature Selection is carried out to the area characteristic information in this feature region, and the area characteristic information after screening is averaged Chi Hua, and then obtain the provincial characteristics data of each characteristic area.

In an alternate embodiment of the present invention, as shown in figure 5, above system can also include:

Pre-training module 450 is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection； By each provincial characteristics of picture included in the preset data collection and the Vector Fusion for representing true classification；It will be fused Attribute classification and the other softmax of non-Attribute class points are done in the full articulamentum output that vector is transmitted to the R-CNN and/or ResNet Class.

In an alternate embodiment of the present invention, image content obtains module 430, is configured to:

Picture feature data are inputted into the LSTM based on attention mechanism；Wherein, in each timestamp of attention LSTM On, input includes: output, each provincial characteristics data and a upper timestamp attention of the semantic LSTM of a upper timestamp The output of LSTM；Its output includes: to each provincial characteristics data divided attention power weight；

Based on each provincial characteristics data attention Weight Acquisition content information corresponding with each provincial characteristics data.

In an alternate embodiment of the present invention, answer output module 440 is configured to:

Problem information and the corresponding content information of each provincial characteristics data are inputted into semantic LSTM simultaneously；Wherein, semantic The input of each timestamp of LSTM include: attention LSTM hidden layer output, the upper timestamp of semanteme LSTM output with An and term vector in problem information；

It is directed to the answer of problem based on semantic LSTM output, and calculated result is output to softmax layers, select probability is most Big term vector is exported as the answer of the vision question and answer to be processed.

The embodiment of the present application combines computer vision (CV) and the big field of natural language processing (NLP) two, provides one Vision answering method and system of the kind based on multiple attention first obtain to be processed in method provided by the embodiments of the present application Pictorial information and corresponding problem information in vision question and answer extract the picture feature data in pictorial information, then by picture Characteristic and problem information input the first long memory network in short-term based on attention mechanism simultaneously, are weighed by divided attention power It recaptures and takes picture content information, then by the combination of two two-way length memory network completion in short-term problem and picture, export wait locate Manage the answer of vision question and answer.

Based on the vision answering method and system provided by the embodiments of the present application based on multiple attention, in R-CNN network Calculating process in increase some memory modules to improve the knowledge source of model in the training process, generate more diversified and reasonable Answer, fusion relevant to problem priori knowledge while extract problem information that memory network is realized end to end, raising asks The whole accuracy rate that topic is answered.

The embodiment of the present application also provides a kind of calculating equipment, and referring to Fig. 6, which includes memory 620, processing Device 610 and it is stored in the computer program that can be run in the memory 620 and by the processor 610, the computer program It is stored in the space 630 for program code in memory 620, which realizes when being executed by processor 610 For executing any one steps of a method in accordance with the invention 631.

The embodiment of the present application also provides a kind of computer readable storage mediums.Referring to Fig. 7, the computer-readable storage medium Matter includes the storage unit for program code, which is provided with the journey for executing steps of a method in accordance with the invention Sequence 631 ', the program are executed by processor.

The embodiment of the present application also provides a kind of computer program products comprising instruction.When the computer program product exists When being run on computer, so that computer executes steps of a method in accordance with the invention.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When computer loads and executes the computer program instructions, whole or portion Ground is divided to generate according to process or function described in the embodiment of the present application.The computer can be general purpose computer, dedicated computing Machine, computer network obtain other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It is not considered that exceeding scope of the present application.

Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with By program come instruction processing unit completion, the program be can store in computer readable storage medium, and the storage is situated between Matter is non-transitory (English: non-transitory) medium, such as random access memory, read-only memory, flash Device, hard disk, solid state hard disk, tape (English: magnetic tape), floppy disk (English: floppy disk), CD (English: Optical disc) and any combination thereof.

The preferable specific embodiment of the above, only the application, but the protection scope of the application is not limited thereto, Within the technical scope of the present application, any changes or substitutions that can be easily thought of by anyone skilled in the art, Should all it cover within the scope of protection of this application.Therefore, the protection scope of the application should be with scope of protection of the claims Subject to.

Claims

1. a kind of vision answering method based on multiple attention, comprising:

Extract the picture feature data in the pictorial information；

The picture feature data and described problem information are inputted into the first long short-term memory net based on attention mechanism simultaneously Network passes through divided attention power Weight Acquisition picture content information corresponding with the picture feature data；

The the second long memory network in short-term for executing semantic analysis is communicated with the described first long memory network in short-term, to institute It states picture content information and described problem information makes inferences combination, obtain and export the answer of the vision question and answer to be processed.

2. the method according to claim 1, wherein the picture feature number extracted in the pictorial information According to, comprising:

For each characteristic area, Feature Selection is carried out to the area characteristic information in this feature region according to degree of overlapping IoU, and Area characteristic information after screening is made into average pond, and then obtains the provincial characteristics data of each characteristic area.

3. according to the method described in claim 2, it is characterized in that, the picture feature data extracted in the pictorial information Before, further includes:

Attribute classification and non-attribute are done into the full articulamentum output that fused vector is transmitted to the R-CNN and/or ResNet The softmax of classification classifies.

4. according to the method described in claim 2, it is characterized in that, described by the picture feature data and described problem information First long in short-term memory network of the input simultaneously based on attention mechanism, it is special by divided attention power Weight Acquisition and the picture Levy the corresponding picture content information of data, comprising:

The picture feature data and described problem information are inputted into the first long short-term memory net based on attention mechanism simultaneously Network；Wherein, on described first long each timestamp of memory network in short-term, input includes: the second of a upper timestamp The long output of memory network in short-term, the output of each provincial characteristics data and the long memory network in short-term of a upper timestamp first；Its Output includes: to each provincial characteristics data divided attention power weight；

Believed based on each provincial characteristics data attention Weight Acquisition content corresponding with each provincial characteristics data Breath.

5. according to the method described in claim 4, it is characterized in that, the second long short-term memory net that semantic analysis will be executed Network is communicated with the described first long memory network in short-term, to make inferences to the picture content information and described problem information In conjunction with obtaining and export the answer of the vision question and answer to be processed, comprising:

Described problem information and the corresponding content information of each provincial characteristics data are inputted simultaneously and execute the of semantic analysis Two long memory networks in short-term；Wherein, the input of the described second long each timestamp of memory network in short-term includes: that the first length is remembered in short-term Recall the hidden layer output of network, on the described second long memory network in short-term in the output and described problem information of a timestamp A term vector；

6. a kind of vision question answering system based on multiple attention, comprising:

Data obtaining module, the pictorial information for being configured to obtain in vision question and answer to be processed and corresponding with the pictorial information The problem of information；

Image content obtains module, is configured to the picture feature data and described problem information while input is based on attention The long memory network in short-term of the first of power mechanism, passes through divided attention power Weight Acquisition picture corresponding with the picture feature data Content information；

Answer output module, the second long memory network in short-term for being configured to execute semantic analysis are remembered in short-term with first length Recall network to be communicated, to make inferences combination to the picture content information and described problem information, obtains and export described The answer of vision question and answer to be processed.

7. system according to claim 6, which is characterized in that the picture feature extraction module is configured to:

8. system according to claim 6, which is characterized in that further include:

Pre-training module is configured to carry out pre-training to the R-CNN and/or ResNet based on preset data collection；It will be described Each provincial characteristics of picture included in preset data collection and the Vector Fusion for representing true classification；Fused vector is passed Attribute classification and the other softmax classification of non-Attribute class are done in the full articulamentum output for transporting to the R-CNN and/or ResNet.

9. system according to claim 6, which is characterized in that the image content obtains module, is configured to:

The picture feature data are inputted to the first long memory network in short-term based on attention mechanism；Wherein, described first On long each timestamp of memory network in short-term, input includes: the defeated of the second long memory network in short-term of a upper timestamp Out, the output of each provincial characteristics data and the long memory network in short-term of a upper timestamp first；Its output includes: to each described Provincial characteristics data divided attention power weight；

10. system according to claim 6, which is characterized in that the answer output module is configured to: