CN110390363A - A kind of Image Description Methods - Google Patents

A kind of Image Description Methods Download PDF

Info

Publication number
CN110390363A
CN110390363A CN201910688842.3A CN201910688842A CN110390363A CN 110390363 A CN110390363 A CN 110390363A CN 201910688842 A CN201910688842 A CN 201910688842A CN 110390363 A CN110390363 A CN 110390363A
Authority
CN
China
Prior art keywords
image
feature
noun
indicate
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910688842.3A
Other languages
Chinese (zh)
Inventor
吕诗奇
刘晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN201910688842.3A priority Critical patent/CN110390363A/en
Publication of CN110390363A publication Critical patent/CN110390363A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A kind of Image Description Methods, the global image feature in picture is extracted using VGG convolutional neural networks, the local image characteristics in picture are extracted using Faster R-CNN network, global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain image co-registration feature, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generate preliminary iamge description sentence, the noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out based on WordNet term vector similarity calculation, iamge description sentence is corrected, generate final iamge description sentence.The present invention reduces the influences of garbage, enhance the embodiment of key message, enhance the fault-tolerant and generalization ability of model, increase the accuracy of descriptive statement.

Description

A kind of Image Description Methods
Technical field
The present invention relates to image recognition processing fields, more particularly to one kind to be based at coding decoder and various features fusion Target detection and descriptive statement generation method in the image of reason.
Background technique
With the high speed development of technology, smart phone is more more and more universal in crowd, and self-timer and conveniently bat also become gradually A kind of social mode of people's mainstream, therefore image is just increased with exponential speed.By 2014, only facebook 250,000,000,000 pictures are just had more than, the method for conventional image retrieval such as carries out artificial mark image, and carries out brief figure As description, the picture of the unbearable this order of magnitude, and handling in a manual manner completely becomes less may be used Can, therefore what is arisen is to come automatic marking and iamge description using machine.
Iamge description healthy and strong development under the overall background of machine learning and deep learning fast development, the application of iamge description It is extremely extensive, including human-computer interaction, image procossing, Objective extraction, video question and answer etc..Iamge description is briefly exactly to utilize Computer come handle people using vision system carry out to target background each in image carry out analysis description process.Image Description but has computer comparable difficulty for mankind's relatively convenient, because computer will not only be found in picture Target and background, while should also be appreciated that they between relationship, this is more complicated something.
The often global characteristics that existing major part Image Description Methods use the feature of image, from description result It can be seen that the relationship description accuracy between target part is low, while it will appear description target error in description result.Mesh Preceding Image Description Methods are in response to the above problems all without effective correcting method.
Summary of the invention
The present invention provides a kind of Image Description Methods, reduces the influence of garbage, enhances the embodiment of key message, The fault-tolerant and generalization ability of model is enhanced, the accuracy of descriptive statement is increased.
In order to achieve the above object, the present invention provides a kind of Image Description Methods comprising the steps of:
The global image feature in picture is extracted using VGG convolutional neural networks;
The local image characteristics in picture are extracted using Faster R-CNN network;
Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain figure As fusion feature;
By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary image Descriptive statement;
The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics The iamge description sentence tentatively generated is corrected, final figure is generated based on WordNet term vector similarity calculation As descriptive statement.
The image convolution formula of the VGG convolutional neural networks are as follows:
Wherein,Indicate j-th point in l feature figure layer,In, MiIndicate the number of window Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function;
The VGG convolutional neural networks include 5 convolutional layers.
The method that the local image characteristics in picture are extracted using Faster R-CNN network includes:
In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network will Characteristic pattern training generates candidate regional frame, and Pooling layers of ROI get target class from candidate regional frame and characteristic pattern Not and the final elaborate position of acquisition detection block is returned, after extracting target area, screens the target that specific gravity in picture is greater than P Region carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains as global characteristics The matrix of N*N dimension;
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo, Spicture indicates the area of whole picture.
The overall situation-Local Feature Fusion algorithm expression formula are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant K is balance factor, and value is positive number;Constraint conditionProjection matrix is normalized.
The method of the two-way length with the attention mechanism image co-registration feature of memory network processing in short-term includes:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijTable Show attention probability matrix, δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to force function meter That calculate is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely;
The quantity of word in sentence, the representation of two-way shot and long term memory unit are indicated using index t=1 ..., N Are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter Wω For the embeded matrix of a word, two-way shot and long term memory unit has two independent workflows, and one is length from left to right Short memory unitAnother be right-to-left length memory unitSt is to obtain t-th of word by mapping function f Position of the lexeme of position and surrounding in sentence, is h dimensional vector, and b indicates bias.
The method corrected to iamge description sentence includes:
Target noun is obtained from local image characteristics using softmax function;
Parsing obtains description noun from preliminary iamge description sentence;
Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 is replaced Noun is described.
The method that target noun is obtained from local image characteristics using softmax function includes:
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, function Data be also a c dimension vector y, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, so that:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.
Given input z, the probability t=c for c=1...C for obtaining each classification are indicated are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
The method that the parsing from preliminary iamge description sentence obtains description noun includes:
To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes one Part of speech resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, by part of speech point For noun and non-noun.
The method for calculating target noun using WordNet and describing the similarity of noun includes:
Using WordNet, extract candidate synonym from the synonym word set of WordNet, by WordNet carry out word to Quantization, obtains the feature of word, calculates characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector;
Lexical Similarity calculation method is expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument Inverse, KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.
Present invention has the main advantage that
1, the correlation degree for increasing global and local feature in image, global characteristics, Faster are extracted using VGG network R-CNN network extracts local feature, obtains fusion feature as the defeated of encoder section by the overall situation-Local Feature Fusion algorithm Out, the influence that garbage is reduced with this enhances the embodiment of key message.
2, it is trained, increases in feature using attention mechanism and two-way LSTM network as decoder section The attention rate of important information increases the serious forgiveness and generalization ability of training pattern.
3, for noun mistake obvious in descriptive statement, the image object obtained when using local shape factor is believed Breath correct descriptive statement, increase is retouched based on WordNet term vector similarity calculation with the noun in descriptive statement The accuracy of predicate sentence.
Detailed description of the invention
Fig. 1 is a kind of flow chart of Image Description Methods provided by the invention.
Fig. 2 is the schematic diagram that the global image feature in picture is extracted using VGG convolutional neural networks.
Fig. 3 is the schematic diagram that the local image characteristics in picture are extracted using Faster R-CNN network.
Fig. 4 is the flow chart for carrying out image rectification.
Fig. 5 is the schematic diagram of the iamge description sentence ultimately generated.
Specific embodiment
Below according to FIG. 1 to FIG. 5, presently preferred embodiments of the present invention is illustrated.
As shown in Figure 1, the present invention provides a kind of Image Description Methods comprising the steps of:
Step 1 carries out size adjustment to picture, and various sizes of input picture is zoomed to uniform sizes.
Step 2 extracts global image feature in picture using VGG convolutional neural networks;
Step 3 extracts local image characteristics in picture using Faster R-CNN network;
Step 4 merges global image feature and local image characteristics by the overall situation-Local Feature Fusion algorithm, Obtain image co-registration feature;
Step 5, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary Iamge description sentence;
In step 6, the image object information obtained when being extracted using local image characteristics and preliminary iamge description sentence Noun the iamge description sentence tentatively generated is corrected based on WordNet term vector similarity calculation, generate most Whole iamge description sentence.
In step 2, the present invention extracts picture global characteristics using VGG16 network.VGG16 network is exactly VGG Network, 16 refer to that the network shares 16 layers.VGG16 convolutional neural networks have powerful feature learning ability, pass through convolution The visual signature that neural network model extracts successfully has applied to a variety of visual identity tasks, and achieves higher identification essence Degree.VGG16 uses the small convolution kernel of continuous several 3x3.For given receptive field, i.e., in each layer of neural network of output The area size that each pixel maps in original image can increase network depth using continuous multilayer non-linear layer to guarantee Learn more complicated mode, although VGG has more parameter, deeper network layer, VGG only needs seldom iteration time Number begins to restrain, and training effect is outstanding.Image carries out the operation of convolution, figure to original image according to predetermined window size As Convolution Formula is as follows:
Wherein,Indicate j-th point in l feature figure layer,With MiIndicate the number of window Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function.
The present invention is directed to the demand of image characteristics extraction, is modified slightly to VGG16 network, because not needing the class to image It is not identified, therefore eliminates the full articulamentum that can be used when finally carrying out class prediction in VGG16 network structure, with This reduces the training number of plies and training parameter, accelerates training effectiveness.VGG16 network in the present invention is mainly by 5 convolutional layer groups At.As shown in Fig. 2, first convolutional layer has used the convolution kernel of 2 3*3*64;Second convolutional layer has used 2 3*3*128 Convolution kernel;Third convolutional layer has used the convolution kernel of 2 3*3*256 and 1*1*256;4th convolutional layer uses The convolution kernel of 2 3*3*512 and 1*1*512;5th convolutional layer has used the volume of 2 3*3*512 and 1*1*512 Product core.After the last layer convolutional layer, characteristic pattern is shown, obtain the matrix of one group of N*N dimension, be defined as Gf.This group of matrix be exactly The global characteristics obtained required in the present invention, it is special that this group of feature learning has arrived color characteristic in image, textural characteristics and shape The integrity attribute of sign etc..
In step 3, the present invention is using the extraction for carrying out local feature based on FasterR-CNN network model, such as Shown in Fig. 3.In FasterR-CNN network, it is used to original image being converted to one group of characteristic pattern using multiple convolutional layers.The spy Sign figure is used for subsequent RPN (Region Proposal Network) layer and ROI Pooling (regions of Interestpooling) layer.RPN network generates candidate regional frame for training, these comprehensive candidate region frames and before Profile information, which gets target category and returned in Pooling layers of ROI, obtains the final elaborate position of detection block.
Loss function when entire RPN network training are as follows:
Wherein i indicates that (each point can predict k preselected area frame to i-th of anchor regional frame in characteristic pattern Anchor boxes, these box are on the image of M*N, to be equivalent to the ROI of pre-selection in original image.Simultaneously these box be all with Centered on every of characteristic pattern, and its size and length-width ratio are all fixed in advance), pi is the prediction probability of the prospect of anchor (value that network query function comes out),It is the truth of anchor, ti represents the frame value of prediction,Represent corresponding prospect The corresponding GTbox of anchor.When anchor is positive sample,When anchor is negative sample, thenIt indicates (each positive sample anchor is only for one correct regional frame (ground true box) coordinate relevant to positive sample anchor It is corresponding with some ground true box that box: one positive sample anchor of a ground true may be corresponded to, then should The IOU of anchor and groundtrue box or it is maximum in all anchor or is greater than 0.7).
What is obtained due to Faster RCNN is one group of target position information and classification information, so wanting and global information If being merged, need to convert this group of data to the matrix of one group of N*N dimension as global characteristics.Therefore the present invention After extracting target, the operation of convolution feature extraction is carried out to target using VGG network, concrete operations are identical as step 2.
Due to often will appear plurality of target in a picture, if all Objective extractions in picture are come out, then When forming local feature, hence it is evident that some unessential target informations can become interference and appear in feature, it is therefore desirable to target Primary screening is carried out, the target that people can be primarily upon is selected.From science research in it can be found that people can focus more in Occupy the target of picture larger specific gravity.Therefore the present invention is assessed using target ratio shared in whole picture, public Formula is shown below.
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo, Spicture indicates the area of whole picture, and the threshold value of P is scheduled on 0.3 in the present invention.That is, being mentioned in Faster R-CNN After taking target region all in image, it is more than those of 30% region and mesh that the present invention, which is only retained in accounting in original image, Mark finally extracts the image information in the region after screening using VGG16.
In step 4, the present invention by the local feature extracted in the global characteristics and step 3 that are extracted in step 2 into Row fusion, the optimized expression formula of blending algorithm are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant K is balance factor, and value is positive number, and k reflects in characteristic extraction procedure global characteristics and local feature to final result Influence degree;Constraint conditionProjection matrix is normalized.
Image overall feature is merged with image local feature by step 4, the present invention has obtained an image and melted Close feature vector.Compared to simple image overall feature, fusion feature vector includes more key messages, contains retouch emphatically The relation information between the image information and target of target is stated, therefore the accuracy of descriptive statement can be promoted.
In steps of 5, the present invention constructs two-way LSTM (long short-term memory) network for having attention mechanism.It is double To LSTM it is considered that word and word get more characteristic informations in relationship sequentially in relationship sequentially, so Effect is better than unidirectional LSTM, is also thus widely applied in the task of natural language processing.Simultaneously in view of two-way LSTM exists Limitation when hidden layer is calculated, increases the weight for being associated with stronger word using attention mechanism, reduces and is associated with weaker word Weight.
Attention model is a kind of model for simulating human brain attention, and basic thought is can be for the attention of things Particular moment concentrates on a certain specific place, can be seldom to the attention of other parts distribution.
The computational efficiency for handling extensive input data can be improved in attention mechanism, while passing through the subset of selection input To reduce the dimension of input data amount.It is also noted that power mechanism is to focus more in useful information, allows and be absorbed in when model training Information more outstanding in input information is found, so as to improve the effect of training result.The it is proposed of attention Mechanism Model be for The frame of help coder-decoder structure (encoder-decoder type), exists to solve encoder-decoder Some defects in design.
It is as follows for image co-registration feature calculation formula obtained in step 4 after attention mechanism is added in the present invention:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijTable Show attention probability matrix, these environment vectors can be with current hidden state hiIt predicts together.CiIt can be by front position It averagely obtains, wherein δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to what force function calculated It is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely.
The sequence of N number of word is converted to a corresponding N number of M dimensional vector to two-way length by memory network in short-term.Bi- at this time LSTM network unit will calculate the context relation of the word.The quantity of word in sentence is indicated using index t=1 ..., N, it is double To shot and long term memory unit representation are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter Wω For the embeded matrix of a word, two-way shot and long term memory unit has two independent workflows, and one is length from left to right Short memory unitAnother be right-to-left length memory unitSt is to obtain t-th of word by mapping function f Position of the lexeme of position and surrounding in sentence, is h dimensional vector, and b indicates bias.
In step 6, the present invention retouches the preliminary images that step 5 obtains using the image local feature extracted in step 3 It states and is corrected, description correction procedure according to the present invention is as shown in Figure 4.
In step 3, present invention uses Faster R-CNN to be extracted location information of the target in picture, meanwhile, FasterR-CNN has also carried out prediction classification to the target detected.The classification of prediction has namely been extracted based on image office The target noun of portion's feature.For will appear all kinds of different targets in picture, multinomial Logistic is used in the present invention It returns, this method is also referred to as softmax function, it is able to solve more classification problems.
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, function Data be also a c dimension vector y, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, can make:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.For given Input z, our the probability t=c for c=1...C of available each classification can indicate are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
For the operation that the picture descriptive statement generated in step 5, the present invention segment the advanced row of sentence, then to each Word carries out the analysis of part of speech, using a part of speech resolver and a part of speech corpus about noun, generates one group of band There is the word binary group of part of speech, part of speech is divided into noun and non-noun by the present invention here.
After the noun for having obtained obtaining from Objective extraction and the noun described from sentence, need to utilize a kind of pass System connects this two-part noun, and present invention utilizes WordNet to solve this problem.WordNet is a kind of spy Different english dictionary, WordNet contain the dictionary that many semantic, part of speech information are different from ordinary meaning.WordNet is logical It can be often grouped with entry difference meaning, synonym collection, that is, synset indicates one group of phrase with identical meanings. WordNet has done concise introduction to each synonym collection, while according to part of speech, and semanteme connects each synset It closes.WordNet is the very perfect knowledge base network that will be seen that the part of speech between word word and semantic relation, simultaneously The also structural information of the classification with part of speech.Therefore can by WordNet by the noun extracted be converted to one group of word to Amount, by between the noun in the noun and description that are obtained to the available Objective extraction of similarity calculation between term vector Similarity size illustrates that the content of description is relatively accurate, if similarity is lower, description has if similarity is larger Institute's error then needs the noun in description being substituted for target noun.
Using WordNet, extract candidate synonym from the synonym word set of WordNet, by WordNet carry out word to Quantization, obtains the feature of word, calculates characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector.
It, can be by calculating the size of the distance between vocabulary as phase between vocabulary according to the above-mentioned definition for lexical feature Like the judgment basis of degree.When the distance between two vocabulary is smaller, then illustrate that the similarity between two words is bigger.According to word Converge similarity value we be more readily available the similarity in WordNet between two vocabulary, Lexical Similarity calculation method It can be expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument Inverse, KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.If Similarity (Wi, Wj) value be lower than 1, then it is assumed that the similarity of two words is lower.
Step 6 carries out the description object occurred in iamge description sentence using the image local information extracted in step 3 Specific aim is corrected, and description target error is prevented.
Iamge description of the invention is illustrated in Fig. 5 generates result.
Present invention has the main advantage that
1, the correlation degree for increasing global and local feature in image, global characteristics, Faster are extracted using VGG network R-CNN network extracts local feature, obtains fusion feature as the defeated of encoder section by the overall situation-Local Feature Fusion algorithm Out, the influence that garbage is reduced with this enhances the embodiment of key message.
2, it is trained, increases in feature using attention mechanism and two-way LSTM network as decoder section The attention rate of important information increases the serious forgiveness and generalization ability of training pattern.
3, for noun mistake obvious in descriptive statement, the image object obtained when using local shape factor is believed Breath correct descriptive statement, increase is retouched based on WordNet term vector similarity calculation with the noun in descriptive statement The accuracy of predicate sentence.
It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims (9)

1. a kind of Image Description Methods, which is characterized in that comprise the steps of:
The global image feature in picture is extracted using VGG convolutional neural networks;
The local image characteristics in picture are extracted using Faster R-CNN network;
Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, image is obtained and melts Close feature;
By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary iamge description Sentence;
The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out Based on WordNet term vector similarity calculation, the iamge description sentence tentatively generated is corrected, final image is generated and retouches Predicate sentence.
2. Image Description Methods as described in claim 1, which is characterized in that the image volume of the VGG convolutional neural networks Product formula are as follows:
Wherein,Indicate j-th point in l feature figure layer,In, MiIndicate the quantity of window,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers, Indicate j-th of offset in l layers, f indicates an excitation function;
The VGG convolutional neural networks include 5 convolutional layers.
3. Image Description Methods as described in claim 1, which is characterized in that described is mentioned using Faster R-CNN network The method for taking the local image characteristics in picture includes:
In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network is by feature Figure training generates candidate regional frame, Pooling layer of ROI got from the regional frame and characteristic pattern of candidate target category with And return and obtain the final elaborate position of detection block, after extracting target area, screen the target area that specific gravity in picture is greater than P Domain carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains the N* as global characteristics The matrix of N-dimensional;
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo, Spicture indicates the area of whole picture.
4. Image Description Methods as described in claim 1, which is characterized in that the overall situation-Local Feature Fusion algorithm Expression formula are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant K is balance factor, and value is positive number;Constraint conditionProjection matrix is normalized.
5. Image Description Methods as described in claim 1, which is characterized in that the two-way length with attention mechanism When memory network processing image co-registration feature method include:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijIndicate note Meaning power probability matrix, δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to what force function calculated It is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely;
The quantity of word in sentence, the representation of two-way shot and long term memory unit are indicated using index t=1 ..., N are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter WωIt is one The embeded matrix of a word, two-way shot and long term memory unit have two independent workflows, and one is length note from left to right Recall unitAnother be right-to-left length memory unitSt is to obtain t-th of lexeme and week by mapping function f Position of the lexeme enclosed in sentence, is h dimensional vector, and b indicates bias.
6. Image Description Methods as described in claim 1, which is characterized in that described to be corrected to iamge description sentence Method includes:
Target noun is obtained from local image characteristics using softmax function;
Parsing obtains description noun from preliminary iamge description sentence;
Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 replaces description Noun.
7. Image Description Methods as claimed in claim 6, which is characterized in that described uses softmax function from Local map As the method for obtaining target noun in feature includes:
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, the number of function According to the vector y for being also a c dimension, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, so that:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.
Given input z, the probability t=c for c=1...C for obtaining each classification are indicated are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
8. Image Description Methods as claimed in claim 6, which is characterized in that described to be solved from preliminary iamge description sentence The method that analysis obtains description noun includes:
To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes a part of speech Resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, part of speech are divided into name Word and non-noun.
9. Image Description Methods as claimed in claim 6, which is characterized in that described calculates target noun using WordNet Method with the similarity of description noun includes:
Using WordNet, candidate synonym is extracted from the synonym word set of WordNet, term vector is carried out by WordNet Change, obtain the feature of word, calculate characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector;
Lexical Similarity calculation method is expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument inverse, KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.
CN201910688842.3A 2019-07-29 2019-07-29 A kind of Image Description Methods Pending CN110390363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910688842.3A CN110390363A (en) 2019-07-29 2019-07-29 A kind of Image Description Methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910688842.3A CN110390363A (en) 2019-07-29 2019-07-29 A kind of Image Description Methods

Publications (1)

Publication Number Publication Date
CN110390363A true CN110390363A (en) 2019-10-29

Family

ID=68287863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910688842.3A Pending CN110390363A (en) 2019-07-29 2019-07-29 A kind of Image Description Methods

Country Status (1)

Country Link
CN (1) CN110390363A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909736A (en) * 2019-11-12 2020-03-24 北京工业大学 Image description method based on long-short term memory model and target detection algorithm
CN111079658A (en) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 Video-based multi-target continuous behavior analysis method, system and device
CN111310867A (en) * 2020-05-11 2020-06-19 北京金山数字娱乐科技有限公司 Text generation method and device based on picture
CN111325323A (en) * 2020-02-19 2020-06-23 山东大学 Power transmission and transformation scene description automatic generation method fusing global information and local information
CN111553371A (en) * 2020-04-17 2020-08-18 中国矿业大学 Image semantic description method and system based on multi-feature extraction
CN111626968A (en) * 2020-04-29 2020-09-04 杭州火烧云科技有限公司 Pixel enhancement design method based on global information and local information
CN111860235A (en) * 2020-07-06 2020-10-30 中国科学院空天信息创新研究院 Method and system for generating high-low-level feature fused attention remote sensing image description
CN111916050A (en) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112232300A (en) * 2020-11-11 2021-01-15 汇纳科技股份有限公司 Global-occlusion adaptive pedestrian training/identification method, system, device, and medium
CN112257759A (en) * 2020-09-27 2021-01-22 华为技术有限公司 Image processing method and device
CN112528989A (en) * 2020-12-01 2021-03-19 重庆邮电大学 Description generation method for semantic fine granularity of image
CN113743096A (en) * 2020-05-27 2021-12-03 南京大学 Crowdsourcing test report similarity detection method based on natural language processing
CN114049501A (en) * 2021-11-22 2022-02-15 江苏科技大学 Image description generation method, system, medium and device fusing cluster search
CN114333804A (en) * 2021-12-27 2022-04-12 北京达佳互联信息技术有限公司 Audio classification identification method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447658A (en) * 2016-09-26 2017-02-22 西北工业大学 Significant target detection method based on FCN (fully convolutional network) and CNN (convolutional neural network)
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447658A (en) * 2016-09-26 2017-02-22 西北工业大学 Significant target detection method based on FCN (fully convolutional network) and CNN (convolutional neural network)
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909736A (en) * 2019-11-12 2020-03-24 北京工业大学 Image description method based on long-short term memory model and target detection algorithm
CN111079658A (en) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 Video-based multi-target continuous behavior analysis method, system and device
CN111079658B (en) * 2019-12-19 2023-10-31 北京海国华创云科技有限公司 Multi-target continuous behavior analysis method, system and device based on video
CN111325323B (en) * 2020-02-19 2023-07-14 山东大学 Automatic power transmission and transformation scene description generation method integrating global information and local information
CN111325323A (en) * 2020-02-19 2020-06-23 山东大学 Power transmission and transformation scene description automatic generation method fusing global information and local information
CN111553371A (en) * 2020-04-17 2020-08-18 中国矿业大学 Image semantic description method and system based on multi-feature extraction
CN111626968A (en) * 2020-04-29 2020-09-04 杭州火烧云科技有限公司 Pixel enhancement design method based on global information and local information
CN111310867A (en) * 2020-05-11 2020-06-19 北京金山数字娱乐科技有限公司 Text generation method and device based on picture
CN113743096A (en) * 2020-05-27 2021-12-03 南京大学 Crowdsourcing test report similarity detection method based on natural language processing
CN111860235A (en) * 2020-07-06 2020-10-30 中国科学院空天信息创新研究院 Method and system for generating high-low-level feature fused attention remote sensing image description
CN111916050A (en) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112257759A (en) * 2020-09-27 2021-01-22 华为技术有限公司 Image processing method and device
CN112232300A (en) * 2020-11-11 2021-01-15 汇纳科技股份有限公司 Global-occlusion adaptive pedestrian training/identification method, system, device, and medium
CN112232300B (en) * 2020-11-11 2024-01-19 汇纳科技股份有限公司 Global occlusion self-adaptive pedestrian training/identifying method, system, equipment and medium
CN112528989B (en) * 2020-12-01 2022-10-18 重庆邮电大学 Description generation method for semantic fine granularity of image
CN112528989A (en) * 2020-12-01 2021-03-19 重庆邮电大学 Description generation method for semantic fine granularity of image
CN114049501A (en) * 2021-11-22 2022-02-15 江苏科技大学 Image description generation method, system, medium and device fusing cluster search
CN114333804A (en) * 2021-12-27 2022-04-12 北京达佳互联信息技术有限公司 Audio classification identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110390363A (en) A kind of Image Description Methods
CN107330100B (en) Image-text bidirectional retrieval method based on multi-view joint embedding space
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN109344288A (en) A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN111598183A (en) Multi-feature fusion image description method
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN108509521A (en) A kind of image search method automatically generating text index
CN111949824A (en) Visual question answering method and system based on semantic alignment and storage medium
Li et al. Multi-modal gated recurrent units for image description
Cheng et al. Stack-VS: Stacked visual-semantic attention for image caption generation
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN116977844A (en) Lightweight underwater target real-time detection method
CN113378919B (en) Image description generation method for fusing visual sense and enhancing multilayer global features
Guo et al. Matching visual features to hierarchical semantic topics for image paragraph captioning
Nam et al. A survey on multimodal bidirectional machine learning translation of image and natural language processing
US11494431B2 (en) Generating accurate and natural captions for figures
Zheng et al. Weakly-supervised image captioning based on rich contextual information
Pu et al. Adaptive feature abstraction for translating video to language
CN117009570A (en) Image-text retrieval method and device based on position information and confidence perception
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191029

RJ01 Rejection of invention patent application after publication