CN110516530A - A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature - Google Patents
A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature Download PDFInfo
- Publication number
- CN110516530A CN110516530A CN201910615360.5A CN201910615360A CN110516530A CN 110516530 A CN110516530 A CN 110516530A CN 201910615360 A CN201910615360 A CN 201910615360A CN 110516530 A CN110516530 A CN 110516530A
- Authority
- CN
- China
- Prior art keywords
- feature
- text
- description
- image
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature.The present invention the following steps are included: 1, to image and with the text of its natural language description carry out data prediction, 2, the attention encoding model based on the enhancing of non-alignment multiple view feature, the target visual feature carried out in each view is reconstructed.3, the dramatic decoder based on MHA.4, model training utilizes back-propagation algorithm training neural network parameter.The present invention proposes a kind of deep neural network for iamge description, especially propose that a kind of pair of image-description text data carry out unified Modeling, it is made inferences in each non-alignment view target signature in the picture, the method for the visual signature of each target being reconstructed more accurately image is described, and the acquisition better effects in iamge description field.
Description
Technical field
The present invention relates to a kind of deep neural network structures for being directed to iamge description (Image Captioning) task, especially
It is related to a kind of pair of image-description data and carries out unified Modeling, is derived according to each substance feature in the multiple views of image special
Correlation between sign, thus the method that more accurately image is described.
Background technique
The progress of nearest deep learning has made great progress computer vision and natural language processing field all.These
At connection vision is allowed for and language is possibly realized, and promote the multi-modal learning tasks across media task, such as image text
With (Image-textMatching), vision question and answer (Visual Question Answering, VQA), vision positioning
(Visual Grounding) and iamge description (Image Captioning)
Iamge description is intended to the content using natural language sentences automatic describing image.This task is challenging, because
It requires people to identify the key object in image for it, and understands the relationship between them.The research of image subtitle can be divided into
Following three classes: the method based on template, the method based on retrieval and the method based on generation.Method based on template uses two ranks
Section strategy solves task: 1) being aligned sentence fragment (for example, subject, object and verb) with the prediction label from image;
2) sentence is generated from section using predefined language template.Use condition random field (CRF) model is belonged to based on the object detected
Property and preposition carry out prediction label, then by filling there is the blank of most probable label to generate the description sentence with template.
Yang et al. selects optimal object using Hidden Markov (HMM) model, and verb is to generate description.In order to alleviate diversity
Problem proposes the method based on retrieval, to search for the intersecting with given image about them from extensive descriptive data base
The most associated description of mode similitude.Karpathy et al. proposes a kind of depth segment embedding grammar, for according to vision piece
Section (object detected) and the connection for describing segment (theme, object and verb) carry out matching image description pair.In test phase,
Execute the description that an image is generated across mode vectors correlation on entire descriptive data base.However, when descriptive data base is very big,
Recall precision becomes the bottleneck of these methods, and the size for limiting database may reduce description diversity.Be based on mould
Plate is different with the model based on retrieval, and the model based on generation is intended to learn to generate with the new of more flexible syntactic structure
The language model of grain husk description.For this purpose, nearest work explores this direction by introducing the neural network for iamge description.
Due to flexibility and outstanding performance, the model based on generation has become the mainstream of image description model.Most successful
Image Description Methods use coder-decoder (Encoder-Decoder) frame, the sequence by machine translation is to sequence
The inspiration of column model.The frame includes the image encoder based on convolutional neural networks (CNN), is based on from input picture extraction
The visual signature in region, and it is based on the description decoder of recurrent neural network (RNN), view-based access control model feature is iteratively generating
Export descriptor.Coder-decoder (Encoder-Decoder) model training usually in a manner of end to end, to minimize
Intersect entropy loss.Based on the frame, a large amount of improvement are had been carried out in nearest work, further to improve image description model
Performance.For example, attention mechanism can be seamlessly inserted into establish the connection of the fine granularity of descriptor and its associated picture region
Into frame.Object in image in order to better understand can extract from object detector trained in advance and be based on region
Target (Bottom-up-attention) feature from bottom to top, to replace traditional CNN convolution feature.In order to by making
The exposure bias (Exposure Bias) of description generated is solved with intersection entropy loss, is based on intensified learning
The algorithm of (Reinforcement Learning, RL) is designed to directly optimize non-differentiable assessment measurement (for example, BLEU
And CIDEr).
Although existing method has been achieved for success, they have following three limitations: (1) current in iamge description
Attention mechanism only simulate characterization mode between interact (that is, object-word) common attention and have ignored self close
Note --- interaction in characterization mode (that is, word is to word and object to object).(2) the current image description model number of plies
Small, calculating dimension is relatively low, and the complex relationship between visual object possibly can not be understood completely.(3) based on the single of region
The visual signature of view possibly can not cover all objects in image, lead to the visual representation for being not enough to generate accurate description.
Summary of the invention
The present invention provides a kind of Image Description Methods enhanced based on non-alignment view feature.One kind is retouched for image
State the deep neural network framework of (Image Captioning) task.The present invention solves technical side used by its technical problem
Case includes the following steps:
Step (1), data prediction extract feature to image and text data:
1-1. is to image preprocessing:
Using the object entity for including in multiple depth targets detection model detection images, visual signature X is extracted.
1-2. pre-processes text data:
The length of 1-2-1. statistics description text, determines the maximum length L for generating description.
1-2-2. participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, will describe problem
Word replaces with the index value in description dictionary, to convert vector for description text, finally translating into a size is L
Vector.
Step (2), the attention coding module based on the enhancing of non-alignment multiple view feature
Its structure is as shown in Figure 1, M non-alignment view feature for input.Each non-alignment view feature includes each
The visual signature for more object entities that self-test measures.Each non-alignment view feature is passed through into linear layer respectively to reduce view spy
The dimension of sign notices that power module (Multi HeadAttention, MHA) to being reconstructed, is tied in the multiterminal for inputting them
Structure is as shown in Figure 2.By the visual signature V after reconstructAIt is added and (Layer Normalization) processing is normalized and obtain
F, as shown in Figure 1.F is inputted in feedforward network (Feed ForwardNetwork, FFN), obtained output is added with F phase again
Row normalization (Layer Normalization) processing obtains the visual signature F comprising non-alignment multiple view informationL。
Step (3), building dramatic decoder
Its structure is as shown in figure 3, will be converted to term vector according to GloVe vocabulary dictionary in description text first.Due to text
This description needs to predict next word according to generated word, so generated text term vector is passed to by High Dimensional Mapping
Shot and long term memory network (Long Short Term Memory, LSTM), the multiterminal note of the vector q input tape mask output it
The relationship in power module (Multi HeadAttention (Mask), MHA (Mask)) middle study vector q of anticipating simultaneously carries out vector q
Reconstruct, by the text feature q after reconstructAThe description text that internal correlation information is obtained after being added with urtext feature q is special
SignText feature will be describedWith the visual signature F for the multiple view information for using attention coding module to obtainLIt inputs simultaneously
Another multiterminal notices that study text feature and the corresponding of visual signature are closed in power module (Multi HeadAttention, MHA)
System, and according to the corresponding relationship to the visual signature F of multiple view informationLThe feature F reconstructed againq.With attention coding module
It is similar, by feature FqWith text featureIt is added and (LayerNormalization) is normalized and obtain feature input later
It is exported obtained in feedforward network (Feed ForwardNetwork, FFN) and does addition again and be normalized
(LayerNormalization) processing obtains the feature G comprising vision and text information.Feature G is passed through into linear layer
(Linear) after, probability is generated using Softmax function, and using this probability output as the output predicted value of network.
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2)
(3) model parameter of neural network is trained in, until whole network model is restrained.
Step (1) is implemented as follows:
1-1. carries out feature extraction to image i and extracts M respectively using existing M deep neural network feature extractor
A non-alignment multiple view feature D, wherein D={ V1,V2,...,VM, each view feature includes k target in image, wherein
Vi={ v1,v2,...,vk, and the vision vector of single target is
1-2. concentrates word different in description text for given description text, first statistical data, according to GloVe word
Word in word list is carried out word insertion by vector matrix, so that word be converted to the word vector of regular length, then defeated
Enter to obtain description text feature in LSTM model, specific formula is as follows:
Q=LSTM (w1,w2,...,wl) (formula 1)
Wherein wkIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text
Degree.
Attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:
2-1. is first by M non-alignment multiple view feature D={ V of input1,V2,...,VMRespectively by different linear
Layer carries out dimensionality reduction, and the dimension after dimensionality reduction is d;Again by the V after dimensionality reduction2,V3,...,VMFeature respectively with V1Input multiterminal note together
Meaning power module handles to obtainItself specific formula is as follows:
2-2. is by Vi AWith visual signature V1It being added, output obtains F by layer normalized, specific formula is as follows:
2-3. inputs F in feedforward network, and obtained output is added with F again, and output is obtained by layer normalized
Visual signature F comprising multiple view informationL, specific formula is as follows:
FL=LayerNorm (FFN (F)+F) (formula 4)
FFN (x)=WF2Dropout(ReLU(WF1xT))) (formula 5)
Wherein,
2-4. is by FLIt is input to next attention based on the enhancing of non-alignment multiple view feature again as input and encodes mould
Block obtains feature to the end after recycling 6 times altogether
Building described in step (3) pays attention to the dramatic decoder of power module based on multiterminal, specific as follows:
3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text
Feature q, specific formula is as follows:
3-2. is by the text Expressive Features q after reconstructAIt is added with original text Expressive Features q, output is returned by layer
One change handles to obtainItself specific formula is as follows:
3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view informationLIt is obtained with 3-2It is defeated
Enter second multiterminal and pay attention to power module, utilizesAnd FLFeature be associated push away to obtain reconstruct feature Fq, specific formula is such as
Under:
3-4. is by feature FqWith featureIt is added, output is obtained by layer normalizedIts specific formula is such as
Under:
3-5. willInput feedforward network in, obtained output again withPhase adduction is wrapped by layer normalized
Feature G containing vision and text information, specific formula is as follows:
G is input to next dramatic decoder again by 3-6., and feature to the end is obtained after recycling 6 times altogether
G6。
3-7. is by G6Into cross one layer of linear layer, using Softmax function generate probability, and using this probability output as
The output predicted vector of network.
P "=softmax (Linear (G6)) (formula 11)
Training pattern described in step (4), specific as follows:
Pre- direction finding can be converted it by carrying out one-hot coding (one-hot encoding) to actual description text answer
P " is measured, then calculates loss using intersection entropy loss (Cross Entropy Loss) function.If N describes the big of text dictionary
Small, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector, then cross entropy loss function is defined as follows:
The present invention has the beneficial effect that:
The present invention relates to a kind of pair of image-description data to carry out unified Modeling, carries out in each target signature in the picture
Reasoning, the method for vision and text information being associated more accurately image is described.The present invention is firstly introduced
Non-alignment multiple view feature, using multiterminal pay attention to power module in image substance feature carry out Cooperative Reasoning, then with existing view
Vision descriptive model can be effectively improved to the accuracy and description language fluency of iamge description after description technique combines by feeling.
Effect of the present invention: 1, using place one's entire reliance upon attention module stack made of depth encoder-decoder model,
The Care for themselves in every kind of mode and the common concern across different modalities are captured simultaneously, to solve first and second limitations.
2, using the stronger non-alignment multi-view image feature of ability to express, and by the Care for themselves between being characterized to characteristics of image to spy
Sign ability to express is supplemented to be improved again, to solve third limitation.
Parameter amount of the present invention is smaller, light weight and efficiently, is conducive to more efficient distribution training, is conducive to be deployed in memory
Limited specific hardware.
Detailed description of the invention
Fig. 1: the depth image descriptive model framework based on the enhancing of non-alignment multiple view feature
Fig. 2: multiterminal pay attention to power module (Multi HeadAttention, MHA)
Fig. 3: dramatic decoder
Specific embodiment
Detail parameters of the invention are further elaborated with below.
As shown in Figure 1, the present invention provides a kind of deep neural network for being directed to iamge description (Image Captioning)
Frame.
Data prediction described in step (1) and feature extraction is carried out to image and text, specific as follows:
Feature extraction of the 1-1. for image data, the present invention use MS-COCO data set as trained and test data,
And utilize the existing Faster-RCNN model based on ResNet-101 and the Faster-RCNN mould based on ResNet-152
Type extracts the visual signature of two non-alignment views.Specifically, image data is separately input to two Faster- by the present invention
In RCNN network, using 100 targets in each Faster-RCNN model inspection image and outline, to the image of each target
It extracts 2048 to tie up to obtain visual signature D, wherein D={ V1,V2,
For 1-2. for describing text, statistical data first concentrates word different in description text, and text is always occurred
Word frequency is higher than 5 to all words and 9347 words in the dictionary that GloVe is provided are recorded in dictionary.
1-3. only takes preceding 16 words to each description sentence, supplements null character if description sentence is discontented with 16 words.
The index value in word dictionary is generated in 1-2 using each word and substitutes the word, character string is completed and turns between numerical value
Change, so that each description is converted to 16 word index vectors.
The 16 dimension index value vectors that 1-4. generates 1-3, are turned each word index using word embedding technology
Change corresponding term vector in GloVe dictionary matrix into, the term vector size used is 300.Therefore each question text becomes big
The small matrix for being 16 × 300.Later using the word vector at each moment as the input of LSTM, wherein LSTM is a kind of circulation mind
Through network structure, the vector q for being set as 16 × 512 dimensions is output it.
The attention coding module that step (2) is enhanced based on non-alignment multiple view feature, specific as follows:
2-1. is first by the non-alignment multiple view visual signature D of input, by carry out dimensionality reduction, by target each in each view
Dimension from 2048 dimension drop to 512 dimensions, the multiple view visual signature dimension after dimensionality reduction is
2-2. is by the V in multiple view feature D1And V2Feature input multiterminal notice that power module is associated and calculate and pass through public affairs
Formula (2) reconstruct obtains 512 dimensions
2-3. ties up 100 × 512 after reconstructWith the V of 100 × 512 dimensions1It is added, output is by layer normalization
Manage size be 100 × 512 feature F.
Feature F is first passed through a linear layer and maps F by 2-4. by feedforward neural network in feedforward neural network
512 dimensions are mapped to 2048 dimensions, then by second linear layer, layer normalized is carried out after being added it with F
Obtain the visual signature F comprising multiple view informationL, size is still 100 × 512 dimensions.
The F that 2-5. obtains 2-4LWith V2Input next attention coding module together, recycle altogether obtain for 6 times it is final
Building dramatic decoder described in step (3), specific as follows:
1-4 is obtained 16 × 512 dimension description text feature q by 3-1., and input multiterminal notice that power module is associated calculating simultaneously
16 × 512 dimension q are obtained by formula (6) reconstructA。
The q that 3-2. ties up 16 × 512 after reconstructAIt is added with the q of 16 × 512 dimensions, layer normalized is passed through in output
Obtain the features that size is 16 × 512 dimensions
3-3. by 2-5 obtain 100 × 512 dimensions comprising more figure information visual signaturesWith the description text of 16 × 512 dimensions
FeatureIt inputs second multiterminal and pays attention to power module, utilizeAndIt is associated and pushes away to obtain 16 × 512 dimension reconstruct feature Fq。
The F that 3-4. ties up 16 × 512 after reconstructqWith 16 × 512 dimensionsIt is added, layer normalized is passed through in output
Obtain the features that size is 16 × 512 dimensions
3-5. is by featureBy feedforward neural network, in feedforward neural network, first passing through a linear layer willIt reflects
Be mapped to 2048 dimensions, then 512 dimensions mapped that by second linear layer, by its withLayer normalization is carried out after being added
Processing obtains the feature G comprising vision and text information, and size is still 100 × 512 dimensions.
G is input to next dramatic decoder again by 3-6., and feature to the end is obtained after recycling 6 times altogether
G6。
3-7. is by the G of above-mentioned generation6It is successively operated by linear layer and softmax, 9347 dimensions of final output word are pre-
Direction finding amount, wherein each element representation predicts that the corresponding word of the element index is the probability value of correct word in the output.
Training pattern described in step (4), specific as follows:
For 9347 dimensional vector of prediction that step (3) generate, it is compared with the correct word in the description, by fixed
The difference between predicted value and practical right value is calculated to form penalty values in the loss function of justice, and according to the penalty values
Using BP algorithm adjustment whole network parameter value so that the network generate prediction with the gap between actual value gradually
It reduces, until network convergence.
Claims (5)
1. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature, it is characterised in that include the following steps:
Step (1), data prediction extract feature to image and text data;
Image preprocessing: using the object entity for including in multiple depth targets detection model detection images, visual signature is extracted
X;
Text data is pre-processed:
Firstly, the length of statistics description text, determines the maximum length L for generating description;
Secondly, participle, intercepts the highest N number of word of frequency, to this N number of word building description text dictionary, by the word for the problem that describes
Replace with description dictionary in index value, thus by description text be converted into vector, finally translate into a size be L to
Amount;
Step (2), the attention coding module based on the more characteristics of objects enhancings of multiple view
For M non-alignment view feature of input;Each non-alignment view feature includes the more object entities respectively detected
Visual signature;Each non-alignment view feature is passed through into linear layer respectively to reduce the dimension of view feature, then they are defeated
Enter multiterminal and notices that power module is reconstructed;By the visual signature V after reconstructAIt is added and is normalized to obtain F;F is defeated
Enter feedforward network, obtained output is added with F again to be normalized, and it is special that processing obtains the vision comprising non-alignment multiple view information
Levy FL;
Step (3), building dramatic decoder
Description text is converted into term vector according to GloVe vocabulary dictionary first;Since text description is needed according to generated
Word predicts next word, so generated text term vector is passed to shot and long term memory network by High Dimensional Mapping, its is defeated
The multiterminal of vector q input tape mask out notice that power module middle school practises the relationship in vector q and vector q is reconstructed, and will weigh
Text feature q after structureAThe description text feature of internal correlation information is obtained after being added with original vector qText will be described
FeatureWith the visual signature F for the multiple view information for using attention coding module to obtainLAnother multiterminal attention is inputted simultaneously
Learn the corresponding relationship of text feature and visual signature in module, and according to the corresponding relationship to the visual signature of multiple view information
FLThe feature F reconstructed againq;Then by feature FqIt is added with description text feature q and is normalized to obtain feature output,
The output is directly inputted to export obtained in feedforward network and does addition again and is normalized to obtain comprising vision and text
The feature G of information;By feature G after linear layer, using Softmax function generate probability, and using this probability output as
The output predicted value of network;
Step (4), model training
According to the predicted value of generation with the difference of the practical description of the image, and using back-propagation algorithm to step (2) and (3)
The model parameter of middle neural network is trained, until whole network model is restrained.
2. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 1, feature exist
It is implemented as follows in step (1):
1-1. is to image preprocessing: feature extraction is carried out to image i, using existing M deep neural network feature extractor,
M view feature D is extracted respectively, wherein D={ V1,V2,...,VM, each view feature includes k target in image,
Middle Vi={ v1,v2,...,vk, and the vision vector of single target is
1-2. pre-processes text data to given:
Statistical data concentrates different word in description text first, according to GloVe term vector matrix by the word in word list
Word insertion is carried out, so that word to be converted to the word vector of regular length, then inputs in LSTM model and obtains description text spy
Sign, specific formula is as follows:
Q=LSTM (w1,w2,...,wl) (formula 1)
Wherein wkIt is k-th of word corresponding word vector in GloVe term vector matrix, l indicates the length of description text.
3. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 2, feature exist
The attention coding module based on the enhancing of non-alignment multiple view feature described in step (2), specific as follows:
2-1. is first by M non-alignment multiple view feature D={ V of input1,V2,...,VMRespectively by different linear layers into
Row dimensionality reduction, the dimension after dimensionality reduction are d;Again by the V after dimensionality reduction2,V3,...,VMFeature respectively with V1Multiterminal attention is inputted together
Resume module obtainsItself specific formula is as follows:
2-2. is by Vi APretreatment V is carried out with to text data1It is added, output obtains F by layer normalized, specific public
Formula is as follows:
2-3. inputs F in feedforward network, and obtained output is added with F again, and output is included by layer normalized
The visual signature F of multiple view informationL, specific formula is as follows:
FL=LayerNorm (FFN (F)+F) (formula 4)
FFN (x)=WF2Dropout(ReLU(WF1xT))) (formula 5)
Wherein,
The F that 2-4. will be obtained newlyLIt is input to next attention coding based on the enhancing of non-alignment multiple view feature again as input
Module obtains feature to the end after recycling 6 times altogether
4. a kind of image answering method based on the enhancing of non-alignment multiple view feature according to claim 3, feature exist
The dramatic decoder of power module is paid attention to described in step (3) based on multiterminal, specific as follows:
3-1. by describe text feature q input tape mask multiterminal attention resume module reconstructed after description text featureSpecific formula is as follows:
3-2. is by the text Expressive Features q after reconstructAIt is added with original text description vectors q, output is by layer normalization
Reason obtainsItself specific formula is as follows:
3-3. is according to step (2) by the resulting visual signature F comprising non-alignment multiple view informationLIt is obtained with step 3-2It is defeated
Enter second multiterminal and pay attention to power module, utilizesAnd FLFeature be associated to obtain reconstruct feature Fq, specific formula is such as
Under:
3-4. is by feature FqWith featureIt is added, output is obtained by layer normalizedItself specific formula is as follows:
3-5. willInput feedforward network in, obtained output again withPhase adduction is obtained by layer normalized comprising view
Feel the feature G with text information, specific formula is as follows:
G is input to next dramatic decoder again by 3-6., and feature G to the end is obtained after recycling 6 times altogether6;
3-7. is by G6Into one layer of linear layer is crossed, probability is being generated using Softmax function, and using this probability output as network
Output predicted vector;
P "=softmax (Linear (G6)) (formula 11).
5. a kind of Image Description Methods based on the enhancing of non-alignment multiple view feature according to claim 4, feature exist
The training pattern described in step (4), specific as follows:
Predicted vector p " can be converted it by carrying out one-hot coding to actual description text answer, then using intersection entropy loss
Function calculates loss;If N describes the size of text dictionary, y indicates that the corresponding index of practical descriptor, p " indicate predicted vector,
Then cross entropy loss function is defined as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910615360.5A CN110516530A (en) | 2019-07-09 | 2019-07-09 | A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910615360.5A CN110516530A (en) | 2019-07-09 | 2019-07-09 | A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110516530A true CN110516530A (en) | 2019-11-29 |
Family
ID=68622410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910615360.5A Withdrawn CN110516530A (en) | 2019-07-09 | 2019-07-09 | A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110516530A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111832504A (en) * | 2020-07-20 | 2020-10-27 | 中国人民解放军战略支援部队航天工程大学 | Space information intelligent integrated generation method for satellite in-orbit application |
CN112200031A (en) * | 2020-09-27 | 2021-01-08 | 上海眼控科技股份有限公司 | Network model training method and equipment for generating image corresponding word description |
CN113139378A (en) * | 2021-03-18 | 2021-07-20 | 杭州电子科技大学 | Image description method based on visual embedding and condition normalization |
CN113191357A (en) * | 2021-05-18 | 2021-07-30 | 中国石油大学(华东) | Multilevel image-text matching method based on graph attention network |
CN113283248A (en) * | 2021-04-29 | 2021-08-20 | 桂林电子科技大学 | Automatic natural language generation method and device for scatter diagram description |
CN113657478A (en) * | 2021-08-10 | 2021-11-16 | 北京航空航天大学 | Three-dimensional point cloud visual positioning method based on relational modeling |
CN114693940A (en) * | 2022-03-22 | 2022-07-01 | 电子科技大学 | Image description method for enhancing feature mixing resolvability based on deep learning |
CN114913403A (en) * | 2022-07-18 | 2022-08-16 | 南京信息工程大学 | Visual question-answering method based on metric learning |
CN115601553A (en) * | 2022-08-15 | 2023-01-13 | 杭州联汇科技股份有限公司(Cn) | Visual model pre-training method based on multi-level picture description data |
CN116484905A (en) * | 2023-06-20 | 2023-07-25 | 合肥高维数据技术有限公司 | Deep neural network model training method for non-aligned samples |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108614815A (en) * | 2018-05-07 | 2018-10-02 | 华东师范大学 | Sentence exchange method and device |
CN109902166A (en) * | 2019-03-12 | 2019-06-18 | 北京百度网讯科技有限公司 | Vision Question-Answering Model, electronic equipment and storage medium |
CN109906460A (en) * | 2016-11-04 | 2019-06-18 | 易享信息技术有限公司 | Dynamic cooperation attention network for question and answer |
-
2019
- 2019-07-09 CN CN201910615360.5A patent/CN110516530A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109906460A (en) * | 2016-11-04 | 2019-06-18 | 易享信息技术有限公司 | Dynamic cooperation attention network for question and answer |
CN108614815A (en) * | 2018-05-07 | 2018-10-02 | 华东师范大学 | Sentence exchange method and device |
CN109902166A (en) * | 2019-03-12 | 2019-06-18 | 北京百度网讯科技有限公司 | Vision Question-Answering Model, electronic equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
ASHISH VASWANI等: "《Attention Is All You Need》", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》 * |
JAY ALAMMAR: "《The Illustrated Transformer》", 《BLOG》 * |
PETER ANDERSON等: "《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering》", 《CVPR 2018》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111832504A (en) * | 2020-07-20 | 2020-10-27 | 中国人民解放军战略支援部队航天工程大学 | Space information intelligent integrated generation method for satellite in-orbit application |
CN112200031A (en) * | 2020-09-27 | 2021-01-08 | 上海眼控科技股份有限公司 | Network model training method and equipment for generating image corresponding word description |
CN113139378A (en) * | 2021-03-18 | 2021-07-20 | 杭州电子科技大学 | Image description method based on visual embedding and condition normalization |
CN113139378B (en) * | 2021-03-18 | 2022-02-18 | 杭州电子科技大学 | Image description method based on visual embedding and condition normalization |
CN113283248A (en) * | 2021-04-29 | 2021-08-20 | 桂林电子科技大学 | Automatic natural language generation method and device for scatter diagram description |
CN113283248B (en) * | 2021-04-29 | 2022-06-21 | 桂林电子科技大学 | Automatic natural language generation method and device for scatter diagram description |
CN113191357A (en) * | 2021-05-18 | 2021-07-30 | 中国石油大学(华东) | Multilevel image-text matching method based on graph attention network |
CN113657478B (en) * | 2021-08-10 | 2023-09-22 | 北京航空航天大学 | Three-dimensional point cloud visual positioning method based on relational modeling |
CN113657478A (en) * | 2021-08-10 | 2021-11-16 | 北京航空航天大学 | Three-dimensional point cloud visual positioning method based on relational modeling |
CN114693940A (en) * | 2022-03-22 | 2022-07-01 | 电子科技大学 | Image description method for enhancing feature mixing resolvability based on deep learning |
CN114693940B (en) * | 2022-03-22 | 2023-04-28 | 电子科技大学 | Image description method with enhanced feature mixing decomposability based on deep learning |
CN114913403A (en) * | 2022-07-18 | 2022-08-16 | 南京信息工程大学 | Visual question-answering method based on metric learning |
CN115601553A (en) * | 2022-08-15 | 2023-01-13 | 杭州联汇科技股份有限公司(Cn) | Visual model pre-training method based on multi-level picture description data |
CN115601553B (en) * | 2022-08-15 | 2023-08-18 | 杭州联汇科技股份有限公司 | Visual model pre-training method based on multi-level picture description data |
CN116484905A (en) * | 2023-06-20 | 2023-07-25 | 合肥高维数据技术有限公司 | Deep neural network model training method for non-aligned samples |
CN116484905B (en) * | 2023-06-20 | 2023-08-29 | 合肥高维数据技术有限公司 | Deep neural network model training method for non-aligned samples |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516530A (en) | A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN108415977B (en) | Deep neural network and reinforcement learning-based generative machine reading understanding method | |
Zhou et al. | Deep semantic dictionary learning for multi-label image classification | |
Amritkar et al. | Image caption generation using deep learning technique | |
CN110222349A (en) | A kind of model and method, computer of the expression of depth dynamic context word | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN113657124A (en) | Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer | |
CN112115687B (en) | Method for generating problem by combining triplet and entity type in knowledge base | |
Bin et al. | Entity slot filling for visual captioning | |
CN115221846A (en) | Data processing method and related equipment | |
Huang et al. | C-Rnn: a fine-grained language model for image captioning | |
CN116415170A (en) | Prompt learning small sample classification method, system, equipment and medium based on pre-training language model | |
CN111062865B (en) | Image processing method, image processing device, computer equipment and storage medium | |
Suresh et al. | Image captioning encoder–decoder models using cnn-rnn architectures: A comparative study | |
Xue et al. | LCSNet: End-to-end lipreading with channel-aware feature selection | |
Deng et al. | A position-aware transformer for image captioning | |
CN113609326A (en) | Image description generation method based on external knowledge and target relation | |
Ludwig et al. | Deep embedding for spatial role labeling | |
Zhang et al. | Self-attention for incomplete utterance rewriting | |
CN117197569A (en) | Image auditing method, image auditing model training method, device and equipment | |
Abdelaziz et al. | Few-shot learning with saliency maps as additional visual information | |
US20230154221A1 (en) | Unified pretraining framework for document understanding | |
Li et al. | Image captioning with weakly-supervised attention penalty |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20191129 |
|
WW01 | Invention patent application withdrawn after publication |