CN110516530A

CN110516530A - A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Info

Publication number: CN110516530A
Application number: CN201910615360.5A
Authority: CN
Inventors: 俞俊; 余宙; 李敬
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-11-29

Abstract

The invention discloses a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the attention encoding model based on the enhancing of non-alignment multiple view feature, the target visual feature carried out in each view is reconstructed.3, the dramatic decoder based on MHA.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for iamge description, especially propose that a kind of pair of image-description text data carry out unified Modeling, it is made inferences in each non-alignment view target signature in the picture, the method for the visual signature of each target being reconstructed more accurately image is described, and the acquisition better effects in iamge description field.

Description

A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Technical field

The present invention relates to a kind of deep neural network structures for being directed to iamge description (Image Captioning) task, especially It is related to a kind of pair of image-description data and carries out unified Modeling, is derived according to each substance feature in the multiple views of image special Correlation between sign, thus the method that more accurately image is described.

Background technique

The progress of nearest deep learning has made great progress computer vision and natural language processing field all.These At connection vision is allowed for and language is possibly realized, and promote the multi-modal learning tasks across media task, such as image text With (Image-textMatching), vision question and answer (Visual Question Answering, VQA), vision positioning (Visual Grounding) and iamge description (Image Captioning)

Iamge description is intended to the content using natural language sentences automatic describing image.This task is challenging, because It requires people to identify the key object in image for it, and understands the relationship between them.The research of image subtitle can be divided into Following three classes: the method based on template, the method based on retrieval and the method based on generation.Method based on template uses two ranks Section strategy solves task: 1) being aligned sentence fragment (for example, subject, object and verb) with the prediction label from image； 2) sentence is generated from section using predefined language template.Use condition random field (CRF) model is belonged to based on the object detected Property and preposition carry out prediction label, then by filling there is the blank of most probable label to generate the description sentence with template. Yang et al. selects optimal object using Hidden Markov (HMM) model, and verb is to generate description.In order to alleviate diversity Problem proposes the method based on retrieval, to search for the intersecting with given image about them from extensive descriptive data base The most associated description of mode similitude.Karpathy et al. proposes a kind of depth segment embedding grammar, for according to vision piece Section (object detected) and the connection for describing segment (theme, object and verb) carry out matching image description pair.In test phase, Execute the description that an image is generated across mode vectors correlation on entire descriptive data base.However, when descriptive data base is very big, Recall precision becomes the bottleneck of these methods, and the size for limiting database may reduce description diversity.Be based on mould Plate is different with the model based on retrieval, and the model based on generation is intended to learn to generate with the new of more flexible syntactic structure The language model of grain husk description.For this purpose, nearest work explores this direction by introducing the neural network for iamge description.

Due to flexibility and outstanding performance, the model based on generation has become the mainstream of image description model.Most successful Image Description Methods use coder-decoder (Encoder-Decoder) frame, the sequence by machine translation is to sequence The inspiration of column model.The frame includes the image encoder based on convolutional neural networks (CNN), is based on from input picture extraction The visual signature in region, and it is based on the description decoder of recurrent neural network (RNN), view-based access control model feature is iteratively generating Export descriptor.Coder-decoder (Encoder-Decoder) model training usually in a manner of end to end, to minimize Intersect entropy loss.Based on the frame, a large amount of improvement are had been carried out in nearest work, further to improve image description model Performance.For example, attention mechanism can be seamlessly inserted into establish the connection of the fine granularity of descriptor and its associated picture region Into frame.Object in image in order to better understand can extract from object detector trained in advance and be based on region Target (Bottom-up-attention) feature from bottom to top, to replace traditional CNN convolution feature.In order to by making The exposure bias (Exposure Bias) of description generated is solved with intersection entropy loss, is based on intensified learning The algorithm of (Reinforcement Learning, RL) is designed to directly optimize non-differentiable assessment measurement (for example, BLEU And CIDEr).

Although existing method has been achieved for success, they have following three limitations: (1) current in iamge description Attention mechanism only simulate characterization mode between interact (that is, object-word) common attention and have ignored self close Note --- interaction in characterization mode (that is, word is to word and object to object).(2) the current image description model number of plies Small, calculating dimension is relatively low, and the complex relationship between visual object possibly can not be understood completely.(3) based on the single of region The visual signature of view possibly can not cover all objects in image, lead to the visual representation for being not enough to generate accurate description.

Summary of the invention

The present invention provides a kind of Image Description Methods enhanced based on non-alignment view feature.One kind is retouched for image State the deep neural network framework of (Image Captioning) task.The present invention solves technical side used by its technical problem Case includes the following steps:

Step (1), data prediction extract feature to image and text data:

1-1. is to image preprocessing:

Using the object entity for including in multiple depth targets detection model detection images, visual signature X is extracted.

1-2. pre-processes text data:

The length of 1-2-1. statistics description text, determines the maximum length L for generating description.

1-2-2. participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, will describe problem Word replaces with the index value in description dictionary, to convert vector for description text, finally translating into a size is L Vector.

Step (2), the attention coding module based on the enhancing of non-alignment multiple view feature

Its structure is as shown in Figure 1, M non-alignment view feature for input.Each non-alignment view feature includes each The visual signature for more object entities that self-test measures.Each non-alignment view feature is passed through into linear layer respectively to reduce view spy The dimension of sign notices that power module (Multi HeadAttention, MHA) to being reconstructed, is tied in the multiterminal for inputting them Structure is as shown in Figure 2.By the visual signature V after reconstruct^AIt is added and (Layer Normalization) processing is normalized and obtain F, as shown in Figure 1.F is inputted in feedforward network (Feed ForwardNetwork, FFN), obtained output is added with F phase again Row normalization (Layer Normalization) processing obtains the visual signature F comprising non-alignment multiple view information^L。

Step (3), building dramatic decoder

Its structure is as shown in figure 3, will be converted to term vector according to GloVe vocabulary dictionary in description text first.Due to text This description needs to predict next word according to generated word, so generated text term vector is passed to by High Dimensional Mapping Shot and long term memory network (Long Short Term Memory, LSTM), the multiterminal note of the vector q input tape mask output it The relationship in power module (Multi HeadAttention (Mask), MHA (Mask)) middle study vector q of anticipating simultaneously carries out vector q Reconstruct, by the text feature q after reconstruct^AThe description text that internal correlation information is obtained after being added with urtext feature q is special SignText feature will be describedWith the visual signature F for the multiple view information for using attention coding module to obtain^LIt inputs simultaneously Another multiterminal notices that study text feature and the corresponding of visual signature are closed in power module (Multi HeadAttention, MHA) System, and according to the corresponding relationship to the visual signature F of multiple view information^LThe feature F reconstructed again^q.With attention coding module It is similar, by feature F^qWith text featureIt is added and (LayerNormalization) is normalized and obtain feature input later It is exported obtained in feedforward network (Feed ForwardNetwork, FFN) and does addition again and be normalized (LayerNormalization) processing obtains the feature G comprising vision and text information.Feature G is passed through into linear layer (Linear) after, probability is generated using Softmax function, and using this probability output as the output predicted value of network.

Step (4), model training

According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2) (3) model parameter of neural network is trained in, until whole network model is restrained.

Step (1) is implemented as follows:

1-1. carries out feature extraction to image i and extracts M respectively using existing M deep neural network feature extractor A non-alignment multiple view feature D, wherein D={ V₁,V₂,...,V_M, each view feature includes k target in image, wherein V_i={ v₁,v₂,...,v_k, and the vision vector of single target is

1-2. concentrates word different in description text for given description text, first statistical data, according to GloVe word Word in word list is carried out word insertion by vector matrix, so that word be converted to the word vector of regular length, then defeated Enter to obtain description text feature in LSTM model, specific formula is as follows:

Q=LSTM (w₁,w₂,...,w_l) (formula 1)

Wherein w_kIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text Degree.

Attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:

2-1. is first by M non-alignment multiple view feature D={ V of input₁,V₂,...,V_MRespectively by different linear Layer carries out dimensionality reduction, and the dimension after dimensionality reduction is d；Again by the V after dimensionality reduction₂,V₃,...,V_MFeature respectively with V₁Input multiterminal note together Meaning power module handles to obtainItself specific formula is as follows:

2-2. is by V_i ^AWith visual signature V₁It being added, output obtains F by layer normalized, specific formula is as follows:

2-3. inputs F in feedforward network, and obtained output is added with F again, and output is obtained by layer normalized Visual signature F comprising multiple view information^L, specific formula is as follows:

F^L=LayerNorm (FFN (F)+F) (formula 4)

FFN (x)=W_F2Dropout(ReLU(W_F1x^T))) (formula 5)

Wherein,

2-4. is by F^LIt is input to next attention based on the enhancing of non-alignment multiple view feature again as input and encodes mould Block obtains feature to the end after recycling 6 times altogether

Building described in step (3) pays attention to the dramatic decoder of power module based on multiterminal, specific as follows:

3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text Feature q, specific formula is as follows:

3-2. is by the text Expressive Features q after reconstruct^AIt is added with original text Expressive Features q, output is returned by layer One change handles to obtainItself specific formula is as follows:

3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view information^LIt is obtained with 3-2It is defeated Enter second multiterminal and pay attention to power module, utilizesAnd F^LFeature be associated push away to obtain reconstruct feature F^q, specific formula is such as Under:

3-4. is by feature F^qWith featureIt is added, output is obtained by layer normalizedIts specific formula is such as Under:

3-5. willInput feedforward network in, obtained output again withPhase adduction is wrapped by layer normalized Feature G containing vision and text information, specific formula is as follows:

G is input to next dramatic decoder again by 3-6., and feature to the end is obtained after recycling 6 times altogether G₆。

3-7. is by G₆Into cross one layer of linear layer, using Softmax function generate probability, and using this probability output as The output predicted vector of network.

P "=softmax (Linear (G₆)) (formula 11)

Training pattern described in step (4), specific as follows:

Pre- direction finding can be converted it by carrying out one-hot coding (one-hot encoding) to actual description text answer P " is measured, then calculates loss using intersection entropy loss (Cross Entropy Loss) function.If N describes the big of text dictionary Small, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector, then cross entropy loss function is defined as follows:

The present invention has the beneficial effect that:

The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture Reasoning, the method for vision and text information being associated more accurately image is described.The present invention is firstly introduced Non-alignment multiple view feature, using multiterminal pay attention to power module in image substance feature carry out Cooperative Reasoning, then with existing view Vision descriptive model can be effectively improved to the accuracy and description language fluency of iamge description after description technique combines by feeling.

Effect of the present invention: 1, using place one's entire reliance upon attention module stack made of depth encoder-decoder model, The Care for themselves in every kind of mode and the common concern across different modalities are captured simultaneously, to solve first and second limitations. 2, using the stronger non-alignment multi-view image feature of ability to express, and by the Care for themselves between being characterized to characteristics of image to spy Sign ability to express is supplemented to be improved again, to solve third limitation.

Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, is conducive to be deployed in memory Limited specific hardware.

Detailed description of the invention

Fig. 1: the depth image descriptive model framework based on the enhancing of non-alignment multiple view feature

Fig. 2: multiterminal pay attention to power module (Multi HeadAttention, MHA)

Fig. 3: dramatic decoder

Specific embodiment

Detail parameters of the invention are further elaborated with below.

As shown in Figure 1, the present invention provides a kind of deep neural network for being directed to iamge description (Image Captioning) Frame.

Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:

Feature extraction of the 1-1. for image data, the present invention use MS-COCO data set as trained and test data, And utilize the existing Faster-RCNN model based on ResNet-101 and the Faster-RCNN mould based on ResNet-152 Type extracts the visual signature of two non-alignment views.Specifically, image data is separately input to two Faster- by the present invention In RCNN network, using 100 targets in each Faster-RCNN model inspection image and outline, to the image of each target It extracts 2048 to tie up to obtain visual signature D, wherein D={ V₁,V₂,

For 1-2. for describing text, statistical data first concentrates word different in description text, and text is always occurred Word frequency is higher than 5 to all words and 9347 words in the dictionary that GloVe is provided are recorded in dictionary.

1-3. only takes preceding 16 words to each description sentence, supplements null character if description sentence is discontented with 16 words. The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value Change, so that each description is converted to 16 word index vectors.

The 16 dimension index value vectors that 1-4. generates 1-3, are turned each word index using word embedding technology Change corresponding term vector in GloVe dictionary matrix into, the term vector size used is 300.Therefore each question text becomes big The small matrix for being 16 × 300.Later using the word vector at each moment as the input of LSTM, wherein LSTM is a kind of circulation mind Through network structure, the vector q for being set as 16 × 512 dimensions is output it.

The attention coding module that step (2) is enhanced based on non-alignment multiple view feature, specific as follows:

2-1. is first by the non-alignment multiple view visual signature D of input, by carry out dimensionality reduction, by target each in each view Dimension from 2048 dimension drop to 512 dimensions, the multiple view visual signature dimension after dimensionality reduction is

2-2. is by the V in multiple view feature D₁And V₂Feature input multiterminal notice that power module is associated and calculate and pass through public affairs Formula (2) reconstruct obtains 512 dimensions

2-3. ties up 100 × 512 after reconstructWith the V of 100 × 512 dimensions₁It is added, output is by layer normalization Manage size be 100 × 512 feature F.

Feature F is first passed through a linear layer and maps F by 2-4. by feedforward neural network in feedforward neural network 512 dimensions are mapped to 2048 dimensions, then by second linear layer, layer normalized is carried out after being added it with F Obtain the visual signature F comprising multiple view information^L, size is still 100 × 512 dimensions.

The F that 2-5. obtains 2-4^LWith V₂Input next attention coding module together, recycle altogether obtain for 6 times it is final

Building dramatic decoder described in step (3), specific as follows:

1-4 is obtained 16 × 512 dimension description text feature q by 3-1., and input multiterminal notice that power module is associated calculating simultaneously 16 × 512 dimension q are obtained by formula (6) reconstruct^A。

The q that 3-2. ties up 16 × 512 after reconstruct^AIt is added with the q of 16 × 512 dimensions, layer normalized is passed through in output Obtain the features that size is 16 × 512 dimensions

3-3. by 2-5 obtain 100 × 512 dimensions comprising more figure information visual signaturesWith the description text of 16 × 512 dimensions FeatureIt inputs second multiterminal and pays attention to power module, utilizeAndIt is associated and pushes away to obtain 16 × 512 dimension reconstruct feature F^q。

The F that 3-4. ties up 16 × 512 after reconstruct^qWith 16 × 512 dimensionsIt is added, layer normalized is passed through in output Obtain the features that size is 16 × 512 dimensions

3-5. is by featureBy feedforward neural network, in feedforward neural network, first passing through a linear layer willIt reflects Be mapped to 2048 dimensions, then 512 dimensions mapped that by second linear layer, by its withLayer normalization is carried out after being added Processing obtains the feature G comprising vision and text information, and size is still 100 × 512 dimensions.

3-7. is by the G of above-mentioned generation₆It is successively operated by linear layer and softmax, 9347 dimensions of final output word are pre- Direction finding amount, wherein each element representation predicts that the corresponding word of the element index is the probability value of correct word in the output.

Training pattern described in step (4), specific as follows:

For 9347 dimensional vector of prediction that step (3) generate, it is compared with the correct word in the description, by fixed The difference between predicted value and practical right value is calculated to form penalty values in the loss function of justice, and according to the penalty values Using BP algorithm adjustment whole network parameter value so that the network generate prediction with the gap between actual value gradually It reduces, until network convergence.

Claims

1. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature, it is characterised in that include the following steps:

Step (1), data prediction extract feature to image and text data；

Image preprocessing: using the object entity for including in multiple depth targets detection model detection images, visual signature is extracted X；

Text data is pre-processed:

Firstly, the length of statistics description text, determines the maximum length L for generating description；

Secondly, participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, by the word for the problem that describes Replace with description dictionary in index value, thus by description text be converted into vector, finally translate into a size be L to Amount；

Step (2), the attention coding module based on the more characteristics of objects enhancings of multiple view

For M non-alignment view feature of input；Each non-alignment view feature includes the more object entities respectively detected Visual signature；Each non-alignment view feature is passed through into linear layer respectively to reduce the dimension of view feature, then they are defeated Enter multiterminal and notices that power module is reconstructed；By the visual signature V after reconstruct^AIt is added and is normalized to obtain F；F is defeated Enter feedforward network, obtained output is added with F again to be normalized, and it is special that processing obtains the vision comprising non-alignment multiple view information Levy F^L；

Step (3), building dramatic decoder

Description text is converted into term vector according to GloVe vocabulary dictionary first；Since text description is needed according to generated Word predicts next word, so generated text term vector is passed to shot and long term memory network by High Dimensional Mapping, its is defeated The multiterminal of vector q input tape mask out notice that power module middle school practises the relationship in vector q and vector q is reconstructed, and will weigh Text feature q after structure^AThe description text feature of internal correlation information is obtained after being added with original vector qText will be described FeatureWith the visual signature F for the multiple view information for using attention coding module to obtain^LAnother multiterminal attention is inputted simultaneously Learn the corresponding relationship of text feature and visual signature in module, and according to the corresponding relationship to the visual signature of multiple view information F^LThe feature F reconstructed again^q；Then by feature F^qIt is added with description text feature q and is normalized to obtain feature output, The output is directly inputted to export obtained in feedforward network and does addition again and is normalized to obtain comprising vision and text The feature G of information；By feature G after linear layer, using Softmax function generate probability, and using this probability output as The output predicted value of network；

Step (4), model training

According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2) and (3) The model parameter of middle neural network is trained, until whole network model is restrained.

2. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 1, feature exist It is implemented as follows in step (1):

1-1. is to image preprocessing: feature extraction is carried out to image i, using existing M deep neural network feature extractor, M view feature D is extracted respectively, wherein D={ V₁,V₂,...,V_M, each view feature includes k target in image, Middle V_i={ v₁,v₂,...,v_k, and the vision vector of single target is

1-2. pre-processes text data to given:

Statistical data concentrates different word in description text first, according to GloVe term vector matrix by the word in word list Word insertion is carried out, so that word to be converted to the word vector of regular length, then inputs in LSTM model and obtains description text spy Sign, specific formula is as follows:

Q=LSTM (w₁,w₂,...,w_l) (formula 1)

Wherein w_kIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text.

3. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 2, feature exist The attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:

2-1. is first by M non-alignment multiple view feature D={ V of input₁,V₂,...,V_MRespectively by different linear layers into Row dimensionality reduction, the dimension after dimensionality reduction are d；Again by the V after dimensionality reduction₂,V₃,...,V_MFeature respectively with V₁Multiterminal attention is inputted together Resume module obtainsItself specific formula is as follows:

2-2. is by V_i ^APretreatment V is carried out with to text data₁It is added, output obtains F by layer normalized, specific public Formula is as follows:

2-3. inputs F in feedforward network, and obtained output is added with F again, and output is included by layer normalized The visual signature F of multiple view information^L, specific formula is as follows:

F^L=LayerNorm (FFN (F)+F) (formula 4)

FFN (x)=W_F2Dropout(ReLU(W_F1x^T))) (formula 5)

Wherein,

The F that 2-4. will be obtained newly^LIt is input to next attention coding based on the enhancing of non-alignment multiple view feature again as input Module obtains feature to the end after recycling 6 times altogether

4. a kind of image answering method based on the enhancing of non-alignment multiple view feature according to claim 3, feature exist The dramatic decoder of power module is paid attention to described in step (3) based on multiterminal, specific as follows:

3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text featureSpecific formula is as follows:

3-2. is by the text Expressive Features q after reconstruct^AIt is added with original text description vectors q, output is by layer normalization Reason obtainsItself specific formula is as follows:

3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view information^LIt is obtained with step 3-2It is defeated Enter second multiterminal and pay attention to power module, utilizesAnd F^LFeature be associated to obtain reconstruct feature F^q, specific formula is such as Under:

3-4. is by feature F^qWith featureIt is added, output is obtained by layer normalizedItself specific formula is as follows:

3-5. willInput feedforward network in, obtained output again withPhase adduction is obtained by layer normalized comprising view Feel the feature G with text information, specific formula is as follows:

G is input to next dramatic decoder again by 3-6., and feature G to the end is obtained after recycling 6 times altogether₆；

3-7. is by G₆Into one layer of linear layer is crossed, probability is being generated using Softmax function, and using this probability output as network Output predicted vector；

P "=softmax (Linear (G₆)) (formula 11).

5. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 4, feature exist The training pattern described in step (4), specific as follows:

Predicted vector p " can be converted it by carrying out one-hot coding to actual description text answer, then using intersection entropy loss Function calculates loss；If N describes the size of text dictionary, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector, Then cross entropy loss function is defined as follows: