CN110390363A

CN110390363A - A kind of Image Description Methods

Info

Publication number: CN110390363A
Application number: CN201910688842.3A
Authority: CN
Inventors: 吕诗奇; 刘晋
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2019-10-29

Abstract

A kind of Image Description Methods, the global image feature in picture is extracted using VGG convolutional neural networks, the local image characteristics in picture are extracted using Faster R-CNN network, global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain image co-registration feature, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generate preliminary iamge description sentence, the noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out based on WordNet term vector similarity calculation, iamge description sentence is corrected, generate final iamge description sentence.The present invention reduces the influences of garbage, enhance the embodiment of key message, enhance the fault-tolerant and generalization ability of model, increase the accuracy of descriptive statement.

Description

A kind of Image Description Methods

Technical field

The present invention relates to image recognition processing fields, more particularly to one kind to be based at coding decoder and various features fusion Target detection and descriptive statement generation method in the image of reason.

Background technique

With the high speed development of technology, smart phone is more more and more universal in crowd, and self-timer and conveniently bat also become gradually A kind of social mode of people's mainstream, therefore image is just increased with exponential speed.By 2014, only facebook 250,000,000,000 pictures are just had more than, the method for conventional image retrieval such as carries out artificial mark image, and carries out brief figure As description, the picture of the unbearable this order of magnitude, and handling in a manual manner completely becomes less may be used Can, therefore what is arisen is to come automatic marking and iamge description using machine.

Iamge description healthy and strong development under the overall background of machine learning and deep learning fast development, the application of iamge description It is extremely extensive, including human-computer interaction, image procossing, Objective extraction, video question and answer etc..Iamge description is briefly exactly to utilize Computer come handle people using vision system carry out to target background each in image carry out analysis description process.Image Description but has computer comparable difficulty for mankind's relatively convenient, because computer will not only be found in picture Target and background, while should also be appreciated that they between relationship, this is more complicated something.

The often global characteristics that existing major part Image Description Methods use the feature of image, from description result It can be seen that the relationship description accuracy between target part is low, while it will appear description target error in description result.Mesh Preceding Image Description Methods are in response to the above problems all without effective correcting method.

Summary of the invention

The present invention provides a kind of Image Description Methods, reduces the influence of garbage, enhances the embodiment of key message, The fault-tolerant and generalization ability of model is enhanced, the accuracy of descriptive statement is increased.

In order to achieve the above object, the present invention provides a kind of Image Description Methods comprising the steps of:

The global image feature in picture is extracted using VGG convolutional neural networks；

The local image characteristics in picture are extracted using Faster R-CNN network；

Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain figure As fusion feature；

By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary image Descriptive statement；

The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics The iamge description sentence tentatively generated is corrected, final figure is generated based on WordNet term vector similarity calculation As descriptive statement.

The image convolution formula of the VGG convolutional neural networks are as follows:

Wherein,Indicate j-th point in l feature figure layer,In, M_iIndicate the number of window Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function；

The VGG convolutional neural networks include 5 convolutional layers.

The method that the local image characteristics in picture are extracted using Faster R-CNN network includes:

In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network will Characteristic pattern training generates candidate regional frame, and Pooling layers of ROI get target class from candidate regional frame and characteristic pattern Not and the final elaborate position of acquisition detection block is returned, after extracting target area, screens the target that specific gravity in picture is greater than P Region carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains as global characteristics The matrix of N*N dimension；

Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo, Spicture indicates the area of whole picture.

The overall situation-Local Feature Fusion algorithm expression formula are as follows:

Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature；In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible；Constant K is balance factor, and value is positive number；Constraint conditionProjection matrix is normalized.

The method of the two-way length with the attention mechanism image co-registration feature of memory network processing in short-term includes:

δ_i=softmax (f_att(h_i, s_j))

f_att(h_i, s_j)=tanh (W₁h_i+W₂s_j)

Wherein, C_iThat indicate is environment vector, h_iIndicate current hidden state, s_jIndicate the hidden state of front, a_ijTable Show attention probability matrix, δ_iIt is power added by current state, that is, attention weight, fatt (h_i, s_j) pay attention to force function meter That calculate is h_iAnd s_jBetween non-normalized apportioning cost, calculated in the way of connecting entirely；

The quantity of word in sentence, the representation of two-way shot and long term memory unit are indicated using index t=1 ..., N Are as follows:

x_t=W_ωθ_t

e_t=f (W_ex_t+b_e)

Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter W_ω For the embeded matrix of a word, two-way shot and long term memory unit has two independent workflows, and one is length from left to right Short memory unitAnother be right-to-left length memory unitSt is to obtain t-th of word by mapping function f Position of the lexeme of position and surrounding in sentence, is h dimensional vector, and b indicates bias.

The method corrected to iamge description sentence includes:

Target noun is obtained from local image characteristics using softmax function；

Parsing obtains description noun from preliminary iamge description sentence；

Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 is replaced Noun is described.

The method that target noun is obtained from local image characteristics using softmax function includes:

Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, function Data be also a c dimension vector y, the value of the inside is defined as follows between 0 to 1:

Denominator in formula acts as the effect of regular terms, so that:

As the output layer of neural network, the value in softmax function can be indicated with c neuron.

Given input z, the probability t=c for c=1...C for obtaining each classification are indicated are as follows:

Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.

The method that the parsing from preliminary iamge description sentence obtains description noun includes:

To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes one Part of speech resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, by part of speech point For noun and non-noun.

The method for calculating target noun using WordNet and describing the similarity of noun includes:

Using WordNet, extract candidate synonym from the synonym word set of WordNet, by WordNet carry out word to Quantization, obtains the feature of word, calculates characteristic set SW:

Feature (SW)={ { W_S, { W_C}}

Wherein, { W_SIndicate that image object extracts the term vector feature of noun, { W_CIndicate description in noun term vector；

Lexical Similarity calculation method is expressed from the next:

Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occur_iDocument Inverse, K_SIndicate the weight of synonym feature, K_CIndicate the weight of generic character.

Present invention has the main advantage that

1, the correlation degree for increasing global and local feature in image, global characteristics, Faster are extracted using VGG network R-CNN network extracts local feature, obtains fusion feature as the defeated of encoder section by the overall situation-Local Feature Fusion algorithm Out, the influence that garbage is reduced with this enhances the embodiment of key message.

2, it is trained, increases in feature using attention mechanism and two-way LSTM network as decoder section The attention rate of important information increases the serious forgiveness and generalization ability of training pattern.

3, for noun mistake obvious in descriptive statement, the image object obtained when using local shape factor is believed Breath correct descriptive statement, increase is retouched based on WordNet term vector similarity calculation with the noun in descriptive statement The accuracy of predicate sentence.

Detailed description of the invention

Fig. 1 is a kind of flow chart of Image Description Methods provided by the invention.

Fig. 2 is the schematic diagram that the global image feature in picture is extracted using VGG convolutional neural networks.

Fig. 3 is the schematic diagram that the local image characteristics in picture are extracted using Faster R-CNN network.

Fig. 4 is the flow chart for carrying out image rectification.

Fig. 5 is the schematic diagram of the iamge description sentence ultimately generated.

Specific embodiment

Below according to FIG. 1 to FIG. 5, presently preferred embodiments of the present invention is illustrated.

As shown in Figure 1, the present invention provides a kind of Image Description Methods comprising the steps of:

Step 1 carries out size adjustment to picture, and various sizes of input picture is zoomed to uniform sizes.

Step 2 extracts global image feature in picture using VGG convolutional neural networks；

Step 3 extracts local image characteristics in picture using Faster R-CNN network；

Step 4 merges global image feature and local image characteristics by the overall situation-Local Feature Fusion algorithm, Obtain image co-registration feature；

Step 5, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary Iamge description sentence；

In step 6, the image object information obtained when being extracted using local image characteristics and preliminary iamge description sentence Noun the iamge description sentence tentatively generated is corrected based on WordNet term vector similarity calculation, generate most Whole iamge description sentence.

In step 2, the present invention extracts picture global characteristics using VGG16 network.VGG16 network is exactly VGG Network, 16 refer to that the network shares 16 layers.VGG16 convolutional neural networks have powerful feature learning ability, pass through convolution The visual signature that neural network model extracts successfully has applied to a variety of visual identity tasks, and achieves higher identification essence Degree.VGG16 uses the small convolution kernel of continuous several 3x3.For given receptive field, i.e., in each layer of neural network of output The area size that each pixel maps in original image can increase network depth using continuous multilayer non-linear layer to guarantee Learn more complicated mode, although VGG has more parameter, deeper network layer, VGG only needs seldom iteration time Number begins to restrain, and training effect is outstanding.Image carries out the operation of convolution, figure to original image according to predetermined window size As Convolution Formula is as follows:

Wherein,Indicate j-th point in l feature figure layer,With M_iIndicate the number of window Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function.

The present invention is directed to the demand of image characteristics extraction, is modified slightly to VGG16 network, because not needing the class to image It is not identified, therefore eliminates the full articulamentum that can be used when finally carrying out class prediction in VGG16 network structure, with This reduces the training number of plies and training parameter, accelerates training effectiveness.VGG16 network in the present invention is mainly by 5 convolutional layer groups At.As shown in Fig. 2, first convolutional layer has used the convolution kernel of 2 3*3*64；Second convolutional layer has used 2 3*3*128 Convolution kernel；Third convolutional layer has used the convolution kernel of 2 3*3*256 and 1*1*256；4th convolutional layer uses The convolution kernel of 2 3*3*512 and 1*1*512；5th convolutional layer has used the volume of 2 3*3*512 and 1*1*512 Product core.After the last layer convolutional layer, characteristic pattern is shown, obtain the matrix of one group of N*N dimension, be defined as Gf.This group of matrix be exactly The global characteristics obtained required in the present invention, it is special that this group of feature learning has arrived color characteristic in image, textural characteristics and shape The integrity attribute of sign etc..

In step 3, the present invention is using the extraction for carrying out local feature based on FasterR-CNN network model, such as Shown in Fig. 3.In FasterR-CNN network, it is used to original image being converted to one group of characteristic pattern using multiple convolutional layers.The spy Sign figure is used for subsequent RPN (Region Proposal Network) layer and ROI Pooling (regions of Interestpooling) layer.RPN network generates candidate regional frame for training, these comprehensive candidate region frames and before Profile information, which gets target category and returned in Pooling layers of ROI, obtains the final elaborate position of detection block.

Loss function when entire RPN network training are as follows:

Wherein i indicates that (each point can predict k preselected area frame to i-th of anchor regional frame in characteristic pattern Anchor boxes, these box are on the image of M*N, to be equivalent to the ROI of pre-selection in original image.Simultaneously these box be all with Centered on every of characteristic pattern, and its size and length-width ratio are all fixed in advance), pi is the prediction probability of the prospect of anchor (value that network query function comes out),It is the truth of anchor, ti represents the frame value of prediction,Represent corresponding prospect The corresponding GTbox of anchor.When anchor is positive sample,When anchor is negative sample, thenIt indicates (each positive sample anchor is only for one correct regional frame (ground true box) coordinate relevant to positive sample anchor It is corresponding with some ground true box that box: one positive sample anchor of a ground true may be corresponded to, then should The IOU of anchor and groundtrue box or it is maximum in all anchor or is greater than 0.7).

What is obtained due to Faster RCNN is one group of target position information and classification information, so wanting and global information If being merged, need to convert this group of data to the matrix of one group of N*N dimension as global characteristics.Therefore the present invention After extracting target, the operation of convolution feature extraction is carried out to target using VGG network, concrete operations are identical as step 2.

Due to often will appear plurality of target in a picture, if all Objective extractions in picture are come out, then When forming local feature, hence it is evident that some unessential target informations can become interference and appear in feature, it is therefore desirable to target Primary screening is carried out, the target that people can be primarily upon is selected.From science research in it can be found that people can focus more in Occupy the target of picture larger specific gravity.Therefore the present invention is assessed using target ratio shared in whole picture, public Formula is shown below.

Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo, Spicture indicates the area of whole picture, and the threshold value of P is scheduled on 0.3 in the present invention.That is, being mentioned in Faster R-CNN After taking target region all in image, it is more than those of 30% region and mesh that the present invention, which is only retained in accounting in original image, Mark finally extracts the image information in the region after screening using VGG16.

In step 4, the present invention by the local feature extracted in the global characteristics and step 3 that are extracted in step 2 into Row fusion, the optimized expression formula of blending algorithm are as follows:

Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature；In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible；Constant K is balance factor, and value is positive number, and k reflects in characteristic extraction procedure global characteristics and local feature to final result Influence degree；Constraint conditionProjection matrix is normalized.

Image overall feature is merged with image local feature by step 4, the present invention has obtained an image and melted Close feature vector.Compared to simple image overall feature, fusion feature vector includes more key messages, contains retouch emphatically The relation information between the image information and target of target is stated, therefore the accuracy of descriptive statement can be promoted.

In steps of 5, the present invention constructs two-way LSTM (long short-term memory) network for having attention mechanism.It is double To LSTM it is considered that word and word get more characteristic informations in relationship sequentially in relationship sequentially, so Effect is better than unidirectional LSTM, is also thus widely applied in the task of natural language processing.Simultaneously in view of two-way LSTM exists Limitation when hidden layer is calculated, increases the weight for being associated with stronger word using attention mechanism, reduces and is associated with weaker word Weight.

Attention model is a kind of model for simulating human brain attention, and basic thought is can be for the attention of things Particular moment concentrates on a certain specific place, can be seldom to the attention of other parts distribution.

The computational efficiency for handling extensive input data can be improved in attention mechanism, while passing through the subset of selection input To reduce the dimension of input data amount.It is also noted that power mechanism is to focus more in useful information, allows and be absorbed in when model training Information more outstanding in input information is found, so as to improve the effect of training result.The it is proposed of attention Mechanism Model be for The frame of help coder-decoder structure (encoder-decoder type), exists to solve encoder-decoder Some defects in design.

It is as follows for image co-registration feature calculation formula obtained in step 4 after attention mechanism is added in the present invention:

δ_i=softmax (f_att(h_i, s_j))

f_att(h_i, s_j)=tanh (W₁h_i+W₂s_j)

Wherein, C_iThat indicate is environment vector, h_iIndicate current hidden state, s_jIndicate the hidden state of front, a_ijTable Show attention probability matrix, these environment vectors can be with current hidden state h_iIt predicts together.C_iIt can be by front position It averagely obtains, wherein δ_iIt is power added by current state, that is, attention weight, fatt (h_i, s_j) pay attention to what force function calculated It is h_iAnd s_jBetween non-normalized apportioning cost, calculated in the way of connecting entirely.

The sequence of N number of word is converted to a corresponding N number of M dimensional vector to two-way length by memory network in short-term.Bi- at this time LSTM network unit will calculate the context relation of the word.The quantity of word in sentence is indicated using index t=1 ..., N, it is double To shot and long term memory unit representation are as follows:

x_t=W_ωθ_t

e_t=f (W_ex_t+b_e)

In step 6, the present invention retouches the preliminary images that step 5 obtains using the image local feature extracted in step 3 It states and is corrected, description correction procedure according to the present invention is as shown in Figure 4.

In step 3, present invention uses Faster R-CNN to be extracted location information of the target in picture, meanwhile, FasterR-CNN has also carried out prediction classification to the target detected.The classification of prediction has namely been extracted based on image office The target noun of portion's feature.For will appear all kinds of different targets in picture, multinomial Logistic is used in the present invention It returns, this method is also referred to as softmax function, it is able to solve more classification problems.

Denominator in formula acts as the effect of regular terms, can make:

As the output layer of neural network, the value in softmax function can be indicated with c neuron.For given Input z, our the probability t=c for c=1...C of available each classification can indicate are as follows:

For the operation that the picture descriptive statement generated in step 5, the present invention segment the advanced row of sentence, then to each Word carries out the analysis of part of speech, using a part of speech resolver and a part of speech corpus about noun, generates one group of band There is the word binary group of part of speech, part of speech is divided into noun and non-noun by the present invention here.

After the noun for having obtained obtaining from Objective extraction and the noun described from sentence, need to utilize a kind of pass System connects this two-part noun, and present invention utilizes WordNet to solve this problem.WordNet is a kind of spy Different english dictionary, WordNet contain the dictionary that many semantic, part of speech information are different from ordinary meaning.WordNet is logical It can be often grouped with entry difference meaning, synonym collection, that is, synset indicates one group of phrase with identical meanings. WordNet has done concise introduction to each synonym collection, while according to part of speech, and semanteme connects each synset It closes.WordNet is the very perfect knowledge base network that will be seen that the part of speech between word word and semantic relation, simultaneously The also structural information of the classification with part of speech.Therefore can by WordNet by the noun extracted be converted to one group of word to Amount, by between the noun in the noun and description that are obtained to the available Objective extraction of similarity calculation between term vector Similarity size illustrates that the content of description is relatively accurate, if similarity is lower, description has if similarity is larger Institute's error then needs the noun in description being substituted for target noun.

Feature (SW)={ { W_S, { W_C}}

Wherein, { W_SIndicate that image object extracts the term vector feature of noun, { W_CIndicate description in noun term vector.

It, can be by calculating the size of the distance between vocabulary as phase between vocabulary according to the above-mentioned definition for lexical feature Like the judgment basis of degree.When the distance between two vocabulary is smaller, then illustrate that the similarity between two words is bigger.According to word Converge similarity value we be more readily available the similarity in WordNet between two vocabulary, Lexical Similarity calculation method It can be expressed from the next:

Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occur_iDocument Inverse, K_SIndicate the weight of synonym feature, K_CIndicate the weight of generic character.If Similarity (W_i, W_j) value be lower than 1, then it is assumed that the similarity of two words is lower.

Step 6 carries out the description object occurred in iamge description sentence using the image local information extracted in step 3 Specific aim is corrected, and description target error is prevented.

Iamge description of the invention is illustrated in Fig. 5 generates result.

Present invention has the main advantage that

It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of Image Description Methods, which is characterized in that comprise the steps of:

Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, image is obtained and melts Close feature；

By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary iamge description Sentence；

The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out Based on WordNet term vector similarity calculation, the iamge description sentence tentatively generated is corrected, final image is generated and retouches Predicate sentence.

2. Image Description Methods as described in claim 1, which is characterized in that the image volume of the VGG convolutional neural networks Product formula are as follows:

Wherein,Indicate j-th point in l feature figure layer,In, M_iIndicate the quantity of window,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers, Indicate j-th of offset in l layers, f indicates an excitation function；

The VGG convolutional neural networks include 5 convolutional layers.

3. Image Description Methods as described in claim 1, which is characterized in that described is mentioned using Faster R-CNN network The method for taking the local image characteristics in picture includes:

In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network is by feature Figure training generates candidate regional frame, Pooling layer of ROI got from the regional frame and characteristic pattern of candidate target category with And return and obtain the final elaborate position of detection block, after extracting target area, screen the target area that specific gravity in picture is greater than P Domain carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains the N* as global characteristics The matrix of N-dimensional；

4. Image Description Methods as described in claim 1, which is characterized in that the overall situation-Local Feature Fusion algorithm Expression formula are as follows:

5. Image Description Methods as described in claim 1, which is characterized in that the two-way length with attention mechanism When memory network processing image co-registration feature method include:

δ_i=softmax (f_att(h_i, s_j))

f_att(h_i, s_j)=tanh (W₁h_i+W₂s_j)

Wherein, C_iThat indicate is environment vector, h_iIndicate current hidden state, s_jIndicate the hidden state of front, a_ijIndicate note Meaning power probability matrix, δ_iIt is power added by current state, that is, attention weight, fatt (h_i, s_j) pay attention to what force function calculated It is h_iAnd s_jBetween non-normalized apportioning cost, calculated in the way of connecting entirely；

x_t=W_ωθ_t

e_t=f (W_ex_t+b_e)

Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter W_ωIt is one The embeded matrix of a word, two-way shot and long term memory unit have two independent workflows, and one is length note from left to right Recall unitAnother be right-to-left length memory unitSt is to obtain t-th of lexeme and week by mapping function f Position of the lexeme enclosed in sentence, is h dimensional vector, and b indicates bias.

6. Image Description Methods as described in claim 1, which is characterized in that described to be corrected to iamge description sentence Method includes:

Parsing obtains description noun from preliminary iamge description sentence；

Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 replaces description Noun.

7. Image Description Methods as claimed in claim 6, which is characterized in that described uses softmax function from Local map As the method for obtaining target noun in feature includes:

Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, the number of function According to the vector y for being also a c dimension, the value of the inside is defined as follows between 0 to 1:

Denominator in formula acts as the effect of regular terms, so that:

8. Image Description Methods as claimed in claim 6, which is characterized in that described to be solved from preliminary iamge description sentence The method that analysis obtains description noun includes:

To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes a part of speech Resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, part of speech are divided into name Word and non-noun.

9. Image Description Methods as claimed in claim 6, which is characterized in that described calculates target noun using WordNet Method with the similarity of description noun includes:

Using WordNet, candidate synonym is extracted from the synonym word set of WordNet, term vector is carried out by WordNet Change, obtain the feature of word, calculate characteristic set SW:

Feature (SW)={ { W_S, { W_C}}

Lexical Similarity calculation method is expressed from the next: