CN110516530A - A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature - Google Patents

A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature Download PDF

Info

Publication number
CN110516530A
CN110516530A CN201910615360.5A CN201910615360A CN110516530A CN 110516530 A CN110516530 A CN 110516530A CN 201910615360 A CN201910615360 A CN 201910615360A CN 110516530 A CN110516530 A CN 110516530A
Authority
CN
China
Prior art keywords
feature
text
description
image
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910615360.5A
Other languages
Chinese (zh)
Inventor
俞俊
余宙
李敬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201910615360.5A priority Critical patent/CN110516530A/en
Publication of CN110516530A publication Critical patent/CN110516530A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the attention encoding model based on the enhancing of non-alignment multiple view feature, the target visual feature carried out in each view is reconstructed.3, the dramatic decoder based on MHA.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for iamge description, especially propose that a kind of pair of image-description text data carry out unified Modeling, it is made inferences in each non-alignment view target signature in the picture, the method for the visual signature of each target being reconstructed more accurately image is described, and the acquisition better effects in iamge description field.

Description

A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature
Technical field
The present invention relates to a kind of deep neural network structures for being directed to iamge description (Image Captioning) task, especially It is related to a kind of pair of image-description data and carries out unified Modeling, is derived according to each substance feature in the multiple views of image special Correlation between sign, thus the method that more accurately image is described.
Background technique
The progress of nearest deep learning has made great progress computer vision and natural language processing field all.These At connection vision is allowed for and language is possibly realized, and promote the multi-modal learning tasks across media task, such as image text With (Image-textMatching), vision question and answer (Visual Question Answering, VQA), vision positioning (Visual Grounding) and iamge description (Image Captioning)
Iamge description is intended to the content using natural language sentences automatic describing image.This task is challenging, because It requires people to identify the key object in image for it, and understands the relationship between them.The research of image subtitle can be divided into Following three classes: the method based on template, the method based on retrieval and the method based on generation.Method based on template uses two ranks Section strategy solves task: 1) being aligned sentence fragment (for example, subject, object and verb) with the prediction label from image; 2) sentence is generated from section using predefined language template.Use condition random field (CRF) model is belonged to based on the object detected Property and preposition carry out prediction label, then by filling there is the blank of most probable label to generate the description sentence with template. Yang et al. selects optimal object using Hidden Markov (HMM) model, and verb is to generate description.In order to alleviate diversity Problem proposes the method based on retrieval, to search for the intersecting with given image about them from extensive descriptive data base The most associated description of mode similitude.Karpathy et al. proposes a kind of depth segment embedding grammar, for according to vision piece Section (object detected) and the connection for describing segment (theme, object and verb) carry out matching image description pair.In test phase, Execute the description that an image is generated across mode vectors correlation on entire descriptive data base.However, when descriptive data base is very big, Recall precision becomes the bottleneck of these methods, and the size for limiting database may reduce description diversity.Be based on mould Plate is different with the model based on retrieval, and the model based on generation is intended to learn to generate with the new of more flexible syntactic structure The language model of grain husk description.For this purpose, nearest work explores this direction by introducing the neural network for iamge description.
Due to flexibility and outstanding performance, the model based on generation has become the mainstream of image description model.Most successful Image Description Methods use coder-decoder (Encoder-Decoder) frame, the sequence by machine translation is to sequence The inspiration of column model.The frame includes the image encoder based on convolutional neural networks (CNN), is based on from input picture extraction The visual signature in region, and it is based on the description decoder of recurrent neural network (RNN), view-based access control model feature is iteratively generating Export descriptor.Coder-decoder (Encoder-Decoder) model training usually in a manner of end to end, to minimize Intersect entropy loss.Based on the frame, a large amount of improvement are had been carried out in nearest work, further to improve image description model Performance.For example, attention mechanism can be seamlessly inserted into establish the connection of the fine granularity of descriptor and its associated picture region Into frame.Object in image in order to better understand can extract from object detector trained in advance and be based on region Target (Bottom-up-attention) feature from bottom to top, to replace traditional CNN convolution feature.In order to by making The exposure bias (Exposure Bias) of description generated is solved with intersection entropy loss, is based on intensified learning The algorithm of (Reinforcement Learning, RL) is designed to directly optimize non-differentiable assessment measurement (for example, BLEU And CIDEr).
Although existing method has been achieved for success, they have following three limitations: (1) current in iamge description Attention mechanism only simulate characterization mode between interact (that is, object-word) common attention and have ignored self close Note --- interaction in characterization mode (that is, word is to word and object to object).(2) the current image description model number of plies Small, calculating dimension is relatively low, and the complex relationship between visual object possibly can not be understood completely.(3) based on the single of region The visual signature of view possibly can not cover all objects in image, lead to the visual representation for being not enough to generate accurate description.
Summary of the invention
The present invention provides a kind of Image Description Methods enhanced based on non-alignment view feature.One kind is retouched for image State the deep neural network framework of (Image Captioning) task.The present invention solves technical side used by its technical problem Case includes the following steps:
Step (1), data prediction extract feature to image and text data:
1-1. is to image preprocessing:
Using the object entity for including in multiple depth targets detection model detection images, visual signature X is extracted.
1-2. pre-processes text data:
The length of 1-2-1. statistics description text, determines the maximum length L for generating description.
1-2-2. participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, will describe problem Word replaces with the index value in description dictionary, to convert vector for description text, finally translating into a size is L Vector.
Step (2), the attention coding module based on the enhancing of non-alignment multiple view feature
Its structure is as shown in Figure 1, M non-alignment view feature for input.Each non-alignment view feature includes each The visual signature for more object entities that self-test measures.Each non-alignment view feature is passed through into linear layer respectively to reduce view spy The dimension of sign notices that power module (Multi HeadAttention, MHA) to being reconstructed, is tied in the multiterminal for inputting them Structure is as shown in Figure 2.By the visual signature V after reconstructAIt is added and (Layer Normalization) processing is normalized and obtain F, as shown in Figure 1.F is inputted in feedforward network (Feed ForwardNetwork, FFN), obtained output is added with F phase again Row normalization (Layer Normalization) processing obtains the visual signature F comprising non-alignment multiple view informationL
Step (3), building dramatic decoder
Its structure is as shown in figure 3, will be converted to term vector according to GloVe vocabulary dictionary in description text first.Due to text This description needs to predict next word according to generated word, so generated text term vector is passed to by High Dimensional Mapping Shot and long term memory network (Long Short Term Memory, LSTM), the multiterminal note of the vector q input tape mask output it The relationship in power module (Multi HeadAttention (Mask), MHA (Mask)) middle study vector q of anticipating simultaneously carries out vector q Reconstruct, by the text feature q after reconstructAThe description text that internal correlation information is obtained after being added with urtext feature q is special SignText feature will be describedWith the visual signature F for the multiple view information for using attention coding module to obtainLIt inputs simultaneously Another multiterminal notices that study text feature and the corresponding of visual signature are closed in power module (Multi HeadAttention, MHA) System, and according to the corresponding relationship to the visual signature F of multiple view informationLThe feature F reconstructed againq.With attention coding module It is similar, by feature FqWith text featureIt is added and (LayerNormalization) is normalized and obtain feature input later It is exported obtained in feedforward network (Feed ForwardNetwork, FFN) and does addition again and be normalized (LayerNormalization) processing obtains the feature G comprising vision and text information.Feature G is passed through into linear layer (Linear) after, probability is generated using Softmax function, and using this probability output as the output predicted value of network.
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2) (3) model parameter of neural network is trained in, until whole network model is restrained.
Step (1) is implemented as follows:
1-1. carries out feature extraction to image i and extracts M respectively using existing M deep neural network feature extractor A non-alignment multiple view feature D, wherein D={ V1,V2,...,VM, each view feature includes k target in image, wherein Vi={ v1,v2,...,vk, and the vision vector of single target is
1-2. concentrates word different in description text for given description text, first statistical data, according to GloVe word Word in word list is carried out word insertion by vector matrix, so that word be converted to the word vector of regular length, then defeated Enter to obtain description text feature in LSTM model, specific formula is as follows:
Q=LSTM (w1,w2,...,wl) (formula 1)
Wherein wkIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text Degree.
Attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:
2-1. is first by M non-alignment multiple view feature D={ V of input1,V2,...,VMRespectively by different linear Layer carries out dimensionality reduction, and the dimension after dimensionality reduction is d;Again by the V after dimensionality reduction2,V3,...,VMFeature respectively with V1Input multiterminal note together Meaning power module handles to obtainItself specific formula is as follows:
2-2. is by Vi AWith visual signature V1It being added, output obtains F by layer normalized, specific formula is as follows:
2-3. inputs F in feedforward network, and obtained output is added with F again, and output is obtained by layer normalized Visual signature F comprising multiple view informationL, specific formula is as follows:
FL=LayerNorm (FFN (F)+F) (formula 4)
FFN (x)=WF2Dropout(ReLU(WF1xT))) (formula 5)
Wherein,
2-4. is by FLIt is input to next attention based on the enhancing of non-alignment multiple view feature again as input and encodes mould Block obtains feature to the end after recycling 6 times altogether
Building described in step (3) pays attention to the dramatic decoder of power module based on multiterminal, specific as follows:
3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text Feature q, specific formula is as follows:
3-2. is by the text Expressive Features q after reconstructAIt is added with original text Expressive Features q, output is returned by layer One change handles to obtainItself specific formula is as follows:
3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view informationLIt is obtained with 3-2It is defeated Enter second multiterminal and pay attention to power module, utilizesAnd FLFeature be associated push away to obtain reconstruct feature Fq, specific formula is such as Under:
3-4. is by feature FqWith featureIt is added, output is obtained by layer normalizedIts specific formula is such as Under:
3-5. willInput feedforward network in, obtained output again withPhase adduction is wrapped by layer normalized Feature G containing vision and text information, specific formula is as follows:
G is input to next dramatic decoder again by 3-6., and feature to the end is obtained after recycling 6 times altogether G6
3-7. is by G6Into cross one layer of linear layer, using Softmax function generate probability, and using this probability output as The output predicted vector of network.
P "=softmax (Linear (G6)) (formula 11)
Training pattern described in step (4), specific as follows:
Pre- direction finding can be converted it by carrying out one-hot coding (one-hot encoding) to actual description text answer P " is measured, then calculates loss using intersection entropy loss (Cross Entropy Loss) function.If N describes the big of text dictionary Small, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector, then cross entropy loss function is defined as follows:
The present invention has the beneficial effect that:
The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture Reasoning, the method for vision and text information being associated more accurately image is described.The present invention is firstly introduced Non-alignment multiple view feature, using multiterminal pay attention to power module in image substance feature carry out Cooperative Reasoning, then with existing view Vision descriptive model can be effectively improved to the accuracy and description language fluency of iamge description after description technique combines by feeling.
Effect of the present invention: 1, using place one's entire reliance upon attention module stack made of depth encoder-decoder model, The Care for themselves in every kind of mode and the common concern across different modalities are captured simultaneously, to solve first and second limitations. 2, using the stronger non-alignment multi-view image feature of ability to express, and by the Care for themselves between being characterized to characteristics of image to spy Sign ability to express is supplemented to be improved again, to solve third limitation.
Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, is conducive to be deployed in memory Limited specific hardware.
Detailed description of the invention
Fig. 1: the depth image descriptive model framework based on the enhancing of non-alignment multiple view feature
Fig. 2: multiterminal pay attention to power module (Multi HeadAttention, MHA)
Fig. 3: dramatic decoder
Specific embodiment
Detail parameters of the invention are further elaborated with below.
As shown in Figure 1, the present invention provides a kind of deep neural network for being directed to iamge description (Image Captioning) Frame.
Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:
Feature extraction of the 1-1. for image data, the present invention use MS-COCO data set as trained and test data, And utilize the existing Faster-RCNN model based on ResNet-101 and the Faster-RCNN mould based on ResNet-152 Type extracts the visual signature of two non-alignment views.Specifically, image data is separately input to two Faster- by the present invention In RCNN network, using 100 targets in each Faster-RCNN model inspection image and outline, to the image of each target It extracts 2048 to tie up to obtain visual signature D, wherein D={ V1,V2,
For 1-2. for describing text, statistical data first concentrates word different in description text, and text is always occurred Word frequency is higher than 5 to all words and 9347 words in the dictionary that GloVe is provided are recorded in dictionary.
1-3. only takes preceding 16 words to each description sentence, supplements null character if description sentence is discontented with 16 words. The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value Change, so that each description is converted to 16 word index vectors.
The 16 dimension index value vectors that 1-4. generates 1-3, are turned each word index using word embedding technology Change corresponding term vector in GloVe dictionary matrix into, the term vector size used is 300.Therefore each question text becomes big The small matrix for being 16 × 300.Later using the word vector at each moment as the input of LSTM, wherein LSTM is a kind of circulation mind Through network structure, the vector q for being set as 16 × 512 dimensions is output it.
The attention coding module that step (2) is enhanced based on non-alignment multiple view feature, specific as follows:
2-1. is first by the non-alignment multiple view visual signature D of input, by carry out dimensionality reduction, by target each in each view Dimension from 2048 dimension drop to 512 dimensions, the multiple view visual signature dimension after dimensionality reduction is
2-2. is by the V in multiple view feature D1And V2Feature input multiterminal notice that power module is associated and calculate and pass through public affairs Formula (2) reconstruct obtains 512 dimensions
2-3. ties up 100 × 512 after reconstructWith the V of 100 × 512 dimensions1It is added, output is by layer normalization Manage size be 100 × 512 feature F.
Feature F is first passed through a linear layer and maps F by 2-4. by feedforward neural network in feedforward neural network 512 dimensions are mapped to 2048 dimensions, then by second linear layer, layer normalized is carried out after being added it with F Obtain the visual signature F comprising multiple view informationL, size is still 100 × 512 dimensions.
The F that 2-5. obtains 2-4LWith V2Input next attention coding module together, recycle altogether obtain for 6 times it is final
Building dramatic decoder described in step (3), specific as follows:
1-4 is obtained 16 × 512 dimension description text feature q by 3-1., and input multiterminal notice that power module is associated calculating simultaneously 16 × 512 dimension q are obtained by formula (6) reconstructA
The q that 3-2. ties up 16 × 512 after reconstructAIt is added with the q of 16 × 512 dimensions, layer normalized is passed through in output Obtain the features that size is 16 × 512 dimensions
3-3. by 2-5 obtain 100 × 512 dimensions comprising more figure information visual signaturesWith the description text of 16 × 512 dimensions FeatureIt inputs second multiterminal and pays attention to power module, utilizeAndIt is associated and pushes away to obtain 16 × 512 dimension reconstruct feature Fq
The F that 3-4. ties up 16 × 512 after reconstructqWith 16 × 512 dimensionsIt is added, layer normalized is passed through in output Obtain the features that size is 16 × 512 dimensions
3-5. is by featureBy feedforward neural network, in feedforward neural network, first passing through a linear layer willIt reflects Be mapped to 2048 dimensions, then 512 dimensions mapped that by second linear layer, by its withLayer normalization is carried out after being added Processing obtains the feature G comprising vision and text information, and size is still 100 × 512 dimensions.
G is input to next dramatic decoder again by 3-6., and feature to the end is obtained after recycling 6 times altogether G6
3-7. is by the G of above-mentioned generation6It is successively operated by linear layer and softmax, 9347 dimensions of final output word are pre- Direction finding amount, wherein each element representation predicts that the corresponding word of the element index is the probability value of correct word in the output.
Training pattern described in step (4), specific as follows:
For 9347 dimensional vector of prediction that step (3) generate, it is compared with the correct word in the description, by fixed The difference between predicted value and practical right value is calculated to form penalty values in the loss function of justice, and according to the penalty values Using BP algorithm adjustment whole network parameter value so that the network generate prediction with the gap between actual value gradually It reduces, until network convergence.

Claims (5)

1. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature, it is characterised in that include the following steps:
Step (1), data prediction extract feature to image and text data;
Image preprocessing: using the object entity for including in multiple depth targets detection model detection images, visual signature is extracted X;
Text data is pre-processed:
Firstly, the length of statistics description text, determines the maximum length L for generating description;
Secondly, participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, by the word for the problem that describes Replace with description dictionary in index value, thus by description text be converted into vector, finally translate into a size be L to Amount;
Step (2), the attention coding module based on the more characteristics of objects enhancings of multiple view
For M non-alignment view feature of input;Each non-alignment view feature includes the more object entities respectively detected Visual signature;Each non-alignment view feature is passed through into linear layer respectively to reduce the dimension of view feature, then they are defeated Enter multiterminal and notices that power module is reconstructed;By the visual signature V after reconstructAIt is added and is normalized to obtain F;F is defeated Enter feedforward network, obtained output is added with F again to be normalized, and it is special that processing obtains the vision comprising non-alignment multiple view information Levy FL
Step (3), building dramatic decoder
Description text is converted into term vector according to GloVe vocabulary dictionary first;Since text description is needed according to generated Word predicts next word, so generated text term vector is passed to shot and long term memory network by High Dimensional Mapping, its is defeated The multiterminal of vector q input tape mask out notice that power module middle school practises the relationship in vector q and vector q is reconstructed, and will weigh Text feature q after structureAThe description text feature of internal correlation information is obtained after being added with original vector qText will be described FeatureWith the visual signature F for the multiple view information for using attention coding module to obtainLAnother multiterminal attention is inputted simultaneously Learn the corresponding relationship of text feature and visual signature in module, and according to the corresponding relationship to the visual signature of multiple view information FLThe feature F reconstructed againq;Then by feature FqIt is added with description text feature q and is normalized to obtain feature output, The output is directly inputted to export obtained in feedforward network and does addition again and is normalized to obtain comprising vision and text The feature G of information;By feature G after linear layer, using Softmax function generate probability, and using this probability output as The output predicted value of network;
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2) and (3) The model parameter of middle neural network is trained, until whole network model is restrained.
2. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 1, feature exist It is implemented as follows in step (1):
1-1. is to image preprocessing: feature extraction is carried out to image i, using existing M deep neural network feature extractor, M view feature D is extracted respectively, wherein D={ V1,V2,...,VM, each view feature includes k target in image, Middle Vi={ v1,v2,...,vk, and the vision vector of single target is
1-2. pre-processes text data to given:
Statistical data concentrates different word in description text first, according to GloVe term vector matrix by the word in word list Word insertion is carried out, so that word to be converted to the word vector of regular length, then inputs in LSTM model and obtains description text spy Sign, specific formula is as follows:
Q=LSTM (w1,w2,...,wl) (formula 1)
Wherein wkIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text.
3. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 2, feature exist The attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:
2-1. is first by M non-alignment multiple view feature D={ V of input1,V2,...,VMRespectively by different linear layers into Row dimensionality reduction, the dimension after dimensionality reduction are d;Again by the V after dimensionality reduction2,V3,...,VMFeature respectively with V1Multiterminal attention is inputted together Resume module obtainsItself specific formula is as follows:
2-2. is by Vi APretreatment V is carried out with to text data1It is added, output obtains F by layer normalized, specific public Formula is as follows:
2-3. inputs F in feedforward network, and obtained output is added with F again, and output is included by layer normalized The visual signature F of multiple view informationL, specific formula is as follows:
FL=LayerNorm (FFN (F)+F) (formula 4)
FFN (x)=WF2Dropout(ReLU(WF1xT))) (formula 5)
Wherein,
The F that 2-4. will be obtained newlyLIt is input to next attention coding based on the enhancing of non-alignment multiple view feature again as input Module obtains feature to the end after recycling 6 times altogether
4. a kind of image answering method based on the enhancing of non-alignment multiple view feature according to claim 3, feature exist The dramatic decoder of power module is paid attention to described in step (3) based on multiterminal, specific as follows:
3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text featureSpecific formula is as follows:
3-2. is by the text Expressive Features q after reconstructAIt is added with original text description vectors q, output is by layer normalization Reason obtainsItself specific formula is as follows:
3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view informationLIt is obtained with step 3-2It is defeated Enter second multiterminal and pay attention to power module, utilizesAnd FLFeature be associated to obtain reconstruct feature Fq, specific formula is such as Under:
3-4. is by feature FqWith featureIt is added, output is obtained by layer normalizedItself specific formula is as follows:
3-5. willInput feedforward network in, obtained output again withPhase adduction is obtained by layer normalized comprising view Feel the feature G with text information, specific formula is as follows:
G is input to next dramatic decoder again by 3-6., and feature G to the end is obtained after recycling 6 times altogether6
3-7. is by G6Into one layer of linear layer is crossed, probability is being generated using Softmax function, and using this probability output as network Output predicted vector;
P "=softmax (Linear (G6)) (formula 11).
5. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 4, feature exist The training pattern described in step (4), specific as follows:
Predicted vector p " can be converted it by carrying out one-hot coding to actual description text answer, then using intersection entropy loss Function calculates loss;If N describes the size of text dictionary, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector, Then cross entropy loss function is defined as follows:
CN201910615360.5A 2019-07-09 2019-07-09 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature Withdrawn CN110516530A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910615360.5A CN110516530A (en) 2019-07-09 2019-07-09 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910615360.5A CN110516530A (en) 2019-07-09 2019-07-09 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Publications (1)

Publication Number Publication Date
CN110516530A true CN110516530A (en) 2019-11-29

Family

ID=68622410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910615360.5A Withdrawn CN110516530A (en) 2019-07-09 2019-07-09 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Country Status (1)

Country Link
CN (1) CN110516530A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832504A (en) * 2020-07-20 2020-10-27 中国人民解放军战略支援部队航天工程大学 Space information intelligent integrated generation method for satellite in-orbit application
CN112200031A (en) * 2020-09-27 2021-01-08 上海眼控科技股份有限公司 Network model training method and equipment for generating image corresponding word description
CN113139378A (en) * 2021-03-18 2021-07-20 杭州电子科技大学 Image description method based on visual embedding and condition normalization
CN113191357A (en) * 2021-05-18 2021-07-30 中国石油大学(华东) Multilevel image-text matching method based on graph attention network
CN113283248A (en) * 2021-04-29 2021-08-20 桂林电子科技大学 Automatic natural language generation method and device for scatter diagram description
CN113657478A (en) * 2021-08-10 2021-11-16 北京航空航天大学 Three-dimensional point cloud visual positioning method based on relational modeling
CN114693940A (en) * 2022-03-22 2022-07-01 电子科技大学 Image description method for enhancing feature mixing resolvability based on deep learning
CN114913403A (en) * 2022-07-18 2022-08-16 南京信息工程大学 Visual question-answering method based on metric learning
CN115601553A (en) * 2022-08-15 2023-01-13 杭州联汇科技股份有限公司(Cn) Visual model pre-training method based on multi-level picture description data
CN116484905A (en) * 2023-06-20 2023-07-25 合肥高维数据技术有限公司 Deep neural network model training method for non-aligned samples

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614815A (en) * 2018-05-07 2018-10-02 华东师范大学 Sentence exchange method and device
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN109906460A (en) * 2016-11-04 2019-06-18 易享信息技术有限公司 Dynamic cooperation attention network for question and answer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109906460A (en) * 2016-11-04 2019-06-18 易享信息技术有限公司 Dynamic cooperation attention network for question and answer
CN108614815A (en) * 2018-05-07 2018-10-02 华东师范大学 Sentence exchange method and device
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI等: "《Attention Is All You Need》", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》 *
JAY ALAMMAR: "《The Illustrated Transformer》", 《BLOG》 *
PETER ANDERSON等: "《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering》", 《CVPR 2018》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832504A (en) * 2020-07-20 2020-10-27 中国人民解放军战略支援部队航天工程大学 Space information intelligent integrated generation method for satellite in-orbit application
CN112200031A (en) * 2020-09-27 2021-01-08 上海眼控科技股份有限公司 Network model training method and equipment for generating image corresponding word description
CN113139378A (en) * 2021-03-18 2021-07-20 杭州电子科技大学 Image description method based on visual embedding and condition normalization
CN113139378B (en) * 2021-03-18 2022-02-18 杭州电子科技大学 Image description method based on visual embedding and condition normalization
CN113283248A (en) * 2021-04-29 2021-08-20 桂林电子科技大学 Automatic natural language generation method and device for scatter diagram description
CN113283248B (en) * 2021-04-29 2022-06-21 桂林电子科技大学 Automatic natural language generation method and device for scatter diagram description
CN113191357A (en) * 2021-05-18 2021-07-30 中国石油大学(华东) Multilevel image-text matching method based on graph attention network
CN113657478B (en) * 2021-08-10 2023-09-22 北京航空航天大学 Three-dimensional point cloud visual positioning method based on relational modeling
CN113657478A (en) * 2021-08-10 2021-11-16 北京航空航天大学 Three-dimensional point cloud visual positioning method based on relational modeling
CN114693940A (en) * 2022-03-22 2022-07-01 电子科技大学 Image description method for enhancing feature mixing resolvability based on deep learning
CN114693940B (en) * 2022-03-22 2023-04-28 电子科技大学 Image description method with enhanced feature mixing decomposability based on deep learning
CN114913403A (en) * 2022-07-18 2022-08-16 南京信息工程大学 Visual question-answering method based on metric learning
CN115601553A (en) * 2022-08-15 2023-01-13 杭州联汇科技股份有限公司(Cn) Visual model pre-training method based on multi-level picture description data
CN115601553B (en) * 2022-08-15 2023-08-18 杭州联汇科技股份有限公司 Visual model pre-training method based on multi-level picture description data
CN116484905A (en) * 2023-06-20 2023-07-25 合肥高维数据技术有限公司 Deep neural network model training method for non-aligned samples
CN116484905B (en) * 2023-06-20 2023-08-29 合肥高维数据技术有限公司 Deep neural network model training method for non-aligned samples

Similar Documents

Publication Publication Date Title
CN110516530A (en) A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN108415977B (en) Deep neural network and reinforcement learning-based generative machine reading understanding method
Zhou et al. Deep semantic dictionary learning for multi-label image classification
Amritkar et al. Image caption generation using deep learning technique
CN110222349A (en) A kind of model and method, computer of the expression of depth dynamic context word
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN113657124A (en) Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
Bin et al. Entity slot filling for visual captioning
CN115221846A (en) Data processing method and related equipment
Huang et al. C-Rnn: a fine-grained language model for image captioning
CN116415170A (en) Prompt learning small sample classification method, system, equipment and medium based on pre-training language model
CN111062865B (en) Image processing method, image processing device, computer equipment and storage medium
Suresh et al. Image captioning encoder–decoder models using cnn-rnn architectures: A comparative study
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
Deng et al. A position-aware transformer for image captioning
CN113609326A (en) Image description generation method based on external knowledge and target relation
Ludwig et al. Deep embedding for spatial role labeling
Zhang et al. Self-attention for incomplete utterance rewriting
CN117197569A (en) Image auditing method, image auditing model training method, device and equipment
Abdelaziz et al. Few-shot learning with saliency maps as additional visual information
US20230154221A1 (en) Unified pretraining framework for document understanding
Li et al. Image captioning with weakly-supervised attention penalty

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20191129

WW01 Invention patent application withdrawn after publication