CN110390363A - A kind of Image Description Methods - Google Patents
A kind of Image Description Methods Download PDFInfo
- Publication number
- CN110390363A CN110390363A CN201910688842.3A CN201910688842A CN110390363A CN 110390363 A CN110390363 A CN 110390363A CN 201910688842 A CN201910688842 A CN 201910688842A CN 110390363 A CN110390363 A CN 110390363A
- Authority
- CN
- China
- Prior art keywords
- image
- feature
- noun
- indicate
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
A kind of Image Description Methods, the global image feature in picture is extracted using VGG convolutional neural networks, the local image characteristics in picture are extracted using Faster R-CNN network, global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain image co-registration feature, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generate preliminary iamge description sentence, the noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out based on WordNet term vector similarity calculation, iamge description sentence is corrected, generate final iamge description sentence.The present invention reduces the influences of garbage, enhance the embodiment of key message, enhance the fault-tolerant and generalization ability of model, increase the accuracy of descriptive statement.
Description
Technical field
The present invention relates to image recognition processing fields, more particularly to one kind to be based at coding decoder and various features fusion
Target detection and descriptive statement generation method in the image of reason.
Background technique
With the high speed development of technology, smart phone is more more and more universal in crowd, and self-timer and conveniently bat also become gradually
A kind of social mode of people's mainstream, therefore image is just increased with exponential speed.By 2014, only facebook
250,000,000,000 pictures are just had more than, the method for conventional image retrieval such as carries out artificial mark image, and carries out brief figure
As description, the picture of the unbearable this order of magnitude, and handling in a manual manner completely becomes less may be used
Can, therefore what is arisen is to come automatic marking and iamge description using machine.
Iamge description healthy and strong development under the overall background of machine learning and deep learning fast development, the application of iamge description
It is extremely extensive, including human-computer interaction, image procossing, Objective extraction, video question and answer etc..Iamge description is briefly exactly to utilize
Computer come handle people using vision system carry out to target background each in image carry out analysis description process.Image
Description but has computer comparable difficulty for mankind's relatively convenient, because computer will not only be found in picture
Target and background, while should also be appreciated that they between relationship, this is more complicated something.
The often global characteristics that existing major part Image Description Methods use the feature of image, from description result
It can be seen that the relationship description accuracy between target part is low, while it will appear description target error in description result.Mesh
Preceding Image Description Methods are in response to the above problems all without effective correcting method.
Summary of the invention
The present invention provides a kind of Image Description Methods, reduces the influence of garbage, enhances the embodiment of key message,
The fault-tolerant and generalization ability of model is enhanced, the accuracy of descriptive statement is increased.
In order to achieve the above object, the present invention provides a kind of Image Description Methods comprising the steps of:
The global image feature in picture is extracted using VGG convolutional neural networks;
The local image characteristics in picture are extracted using Faster R-CNN network;
Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, obtain figure
As fusion feature;
By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary image
Descriptive statement;
The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics
The iamge description sentence tentatively generated is corrected, final figure is generated based on WordNet term vector similarity calculation
As descriptive statement.
The image convolution formula of the VGG convolutional neural networks are as follows:
Wherein,Indicate j-th point in l feature figure layer,In, MiIndicate the number of window
Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function;
The VGG convolutional neural networks include 5 convolutional layers.
The method that the local image characteristics in picture are extracted using Faster R-CNN network includes:
In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network will
Characteristic pattern training generates candidate regional frame, and Pooling layers of ROI get target class from candidate regional frame and characteristic pattern
Not and the final elaborate position of acquisition detection block is returned, after extracting target area, screens the target that specific gravity in picture is greater than P
Region carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains as global characteristics
The matrix of N*N dimension;
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo,
Spicture indicates the area of whole picture.
The overall situation-Local Feature Fusion algorithm expression formula are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant
K is balance factor, and value is positive number;Constraint conditionProjection matrix is normalized.
The method of the two-way length with the attention mechanism image co-registration feature of memory network processing in short-term includes:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijTable
Show attention probability matrix, δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to force function meter
That calculate is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely;
The quantity of word in sentence, the representation of two-way shot and long term memory unit are indicated using index t=1 ..., N
Are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter Wω
For the embeded matrix of a word, two-way shot and long term memory unit has two independent workflows, and one is length from left to right
Short memory unitAnother be right-to-left length memory unitSt is to obtain t-th of word by mapping function f
Position of the lexeme of position and surrounding in sentence, is h dimensional vector, and b indicates bias.
The method corrected to iamge description sentence includes:
Target noun is obtained from local image characteristics using softmax function;
Parsing obtains description noun from preliminary iamge description sentence;
Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 is replaced
Noun is described.
The method that target noun is obtained from local image characteristics using softmax function includes:
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, function
Data be also a c dimension vector y, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, so that:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.
Given input z, the probability t=c for c=1...C for obtaining each classification are indicated are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
The method that the parsing from preliminary iamge description sentence obtains description noun includes:
To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes one
Part of speech resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, by part of speech point
For noun and non-noun.
The method for calculating target noun using WordNet and describing the similarity of noun includes:
Using WordNet, extract candidate synonym from the synonym word set of WordNet, by WordNet carry out word to
Quantization, obtains the feature of word, calculates characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector;
Lexical Similarity calculation method is expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument
Inverse, KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.
Present invention has the main advantage that
1, the correlation degree for increasing global and local feature in image, global characteristics, Faster are extracted using VGG network
R-CNN network extracts local feature, obtains fusion feature as the defeated of encoder section by the overall situation-Local Feature Fusion algorithm
Out, the influence that garbage is reduced with this enhances the embodiment of key message.
2, it is trained, increases in feature using attention mechanism and two-way LSTM network as decoder section
The attention rate of important information increases the serious forgiveness and generalization ability of training pattern.
3, for noun mistake obvious in descriptive statement, the image object obtained when using local shape factor is believed
Breath correct descriptive statement, increase is retouched based on WordNet term vector similarity calculation with the noun in descriptive statement
The accuracy of predicate sentence.
Detailed description of the invention
Fig. 1 is a kind of flow chart of Image Description Methods provided by the invention.
Fig. 2 is the schematic diagram that the global image feature in picture is extracted using VGG convolutional neural networks.
Fig. 3 is the schematic diagram that the local image characteristics in picture are extracted using Faster R-CNN network.
Fig. 4 is the flow chart for carrying out image rectification.
Fig. 5 is the schematic diagram of the iamge description sentence ultimately generated.
Specific embodiment
Below according to FIG. 1 to FIG. 5, presently preferred embodiments of the present invention is illustrated.
As shown in Figure 1, the present invention provides a kind of Image Description Methods comprising the steps of:
Step 1 carries out size adjustment to picture, and various sizes of input picture is zoomed to uniform sizes.
Step 2 extracts global image feature in picture using VGG convolutional neural networks;
Step 3 extracts local image characteristics in picture using Faster R-CNN network;
Step 4 merges global image feature and local image characteristics by the overall situation-Local Feature Fusion algorithm,
Obtain image co-registration feature;
Step 5, by the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary
Iamge description sentence;
In step 6, the image object information obtained when being extracted using local image characteristics and preliminary iamge description sentence
Noun the iamge description sentence tentatively generated is corrected based on WordNet term vector similarity calculation, generate most
Whole iamge description sentence.
In step 2, the present invention extracts picture global characteristics using VGG16 network.VGG16 network is exactly VGG
Network, 16 refer to that the network shares 16 layers.VGG16 convolutional neural networks have powerful feature learning ability, pass through convolution
The visual signature that neural network model extracts successfully has applied to a variety of visual identity tasks, and achieves higher identification essence
Degree.VGG16 uses the small convolution kernel of continuous several 3x3.For given receptive field, i.e., in each layer of neural network of output
The area size that each pixel maps in original image can increase network depth using continuous multilayer non-linear layer to guarantee
Learn more complicated mode, although VGG has more parameter, deeper network layer, VGG only needs seldom iteration time
Number begins to restrain, and training effect is outstanding.Image carries out the operation of convolution, figure to original image according to predetermined window size
As Convolution Formula is as follows:
Wherein,Indicate j-th point in l feature figure layer,With MiIndicate the number of window
Amount,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,Indicate j-th of offset in l layers, f indicates an excitation function.
The present invention is directed to the demand of image characteristics extraction, is modified slightly to VGG16 network, because not needing the class to image
It is not identified, therefore eliminates the full articulamentum that can be used when finally carrying out class prediction in VGG16 network structure, with
This reduces the training number of plies and training parameter, accelerates training effectiveness.VGG16 network in the present invention is mainly by 5 convolutional layer groups
At.As shown in Fig. 2, first convolutional layer has used the convolution kernel of 2 3*3*64;Second convolutional layer has used 2 3*3*128
Convolution kernel;Third convolutional layer has used the convolution kernel of 2 3*3*256 and 1*1*256;4th convolutional layer uses
The convolution kernel of 2 3*3*512 and 1*1*512;5th convolutional layer has used the volume of 2 3*3*512 and 1*1*512
Product core.After the last layer convolutional layer, characteristic pattern is shown, obtain the matrix of one group of N*N dimension, be defined as Gf.This group of matrix be exactly
The global characteristics obtained required in the present invention, it is special that this group of feature learning has arrived color characteristic in image, textural characteristics and shape
The integrity attribute of sign etc..
In step 3, the present invention is using the extraction for carrying out local feature based on FasterR-CNN network model, such as
Shown in Fig. 3.In FasterR-CNN network, it is used to original image being converted to one group of characteristic pattern using multiple convolutional layers.The spy
Sign figure is used for subsequent RPN (Region Proposal Network) layer and ROI Pooling (regions of
Interestpooling) layer.RPN network generates candidate regional frame for training, these comprehensive candidate region frames and before
Profile information, which gets target category and returned in Pooling layers of ROI, obtains the final elaborate position of detection block.
Loss function when entire RPN network training are as follows:
Wherein i indicates that (each point can predict k preselected area frame to i-th of anchor regional frame in characteristic pattern
Anchor boxes, these box are on the image of M*N, to be equivalent to the ROI of pre-selection in original image.Simultaneously these box be all with
Centered on every of characteristic pattern, and its size and length-width ratio are all fixed in advance), pi is the prediction probability of the prospect of anchor
(value that network query function comes out),It is the truth of anchor, ti represents the frame value of prediction,Represent corresponding prospect
The corresponding GTbox of anchor.When anchor is positive sample,When anchor is negative sample, thenIt indicates
(each positive sample anchor is only for one correct regional frame (ground true box) coordinate relevant to positive sample anchor
It is corresponding with some ground true box that box: one positive sample anchor of a ground true may be corresponded to, then should
The IOU of anchor and groundtrue box or it is maximum in all anchor or is greater than 0.7).
What is obtained due to Faster RCNN is one group of target position information and classification information, so wanting and global information
If being merged, need to convert this group of data to the matrix of one group of N*N dimension as global characteristics.Therefore the present invention
After extracting target, the operation of convolution feature extraction is carried out to target using VGG network, concrete operations are identical as step 2.
Due to often will appear plurality of target in a picture, if all Objective extractions in picture are come out, then
When forming local feature, hence it is evident that some unessential target informations can become interference and appear in feature, it is therefore desirable to target
Primary screening is carried out, the target that people can be primarily upon is selected.From science research in it can be found that people can focus more in
Occupy the target of picture larger specific gravity.Therefore the present invention is assessed using target ratio shared in whole picture, public
Formula is shown below.
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo,
Spicture indicates the area of whole picture, and the threshold value of P is scheduled on 0.3 in the present invention.That is, being mentioned in Faster R-CNN
After taking target region all in image, it is more than those of 30% region and mesh that the present invention, which is only retained in accounting in original image,
Mark finally extracts the image information in the region after screening using VGG16.
In step 4, the present invention by the local feature extracted in the global characteristics and step 3 that are extracted in step 2 into
Row fusion, the optimized expression formula of blending algorithm are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant
K is balance factor, and value is positive number, and k reflects in characteristic extraction procedure global characteristics and local feature to final result
Influence degree;Constraint conditionProjection matrix is normalized.
Image overall feature is merged with image local feature by step 4, the present invention has obtained an image and melted
Close feature vector.Compared to simple image overall feature, fusion feature vector includes more key messages, contains retouch emphatically
The relation information between the image information and target of target is stated, therefore the accuracy of descriptive statement can be promoted.
In steps of 5, the present invention constructs two-way LSTM (long short-term memory) network for having attention mechanism.It is double
To LSTM it is considered that word and word get more characteristic informations in relationship sequentially in relationship sequentially, so
Effect is better than unidirectional LSTM, is also thus widely applied in the task of natural language processing.Simultaneously in view of two-way LSTM exists
Limitation when hidden layer is calculated, increases the weight for being associated with stronger word using attention mechanism, reduces and is associated with weaker word
Weight.
Attention model is a kind of model for simulating human brain attention, and basic thought is can be for the attention of things
Particular moment concentrates on a certain specific place, can be seldom to the attention of other parts distribution.
The computational efficiency for handling extensive input data can be improved in attention mechanism, while passing through the subset of selection input
To reduce the dimension of input data amount.It is also noted that power mechanism is to focus more in useful information, allows and be absorbed in when model training
Information more outstanding in input information is found, so as to improve the effect of training result.The it is proposed of attention Mechanism Model be for
The frame of help coder-decoder structure (encoder-decoder type), exists to solve encoder-decoder
Some defects in design.
It is as follows for image co-registration feature calculation formula obtained in step 4 after attention mechanism is added in the present invention:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijTable
Show attention probability matrix, these environment vectors can be with current hidden state hiIt predicts together.CiIt can be by front position
It averagely obtains, wherein δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to what force function calculated
It is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely.
The sequence of N number of word is converted to a corresponding N number of M dimensional vector to two-way length by memory network in short-term.Bi- at this time
LSTM network unit will calculate the context relation of the word.The quantity of word in sentence is indicated using index t=1 ..., N, it is double
To shot and long term memory unit representation are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter Wω
For the embeded matrix of a word, two-way shot and long term memory unit has two independent workflows, and one is length from left to right
Short memory unitAnother be right-to-left length memory unitSt is to obtain t-th of word by mapping function f
Position of the lexeme of position and surrounding in sentence, is h dimensional vector, and b indicates bias.
In step 6, the present invention retouches the preliminary images that step 5 obtains using the image local feature extracted in step 3
It states and is corrected, description correction procedure according to the present invention is as shown in Figure 4.
In step 3, present invention uses Faster R-CNN to be extracted location information of the target in picture, meanwhile,
FasterR-CNN has also carried out prediction classification to the target detected.The classification of prediction has namely been extracted based on image office
The target noun of portion's feature.For will appear all kinds of different targets in picture, multinomial Logistic is used in the present invention
It returns, this method is also referred to as softmax function, it is able to solve more classification problems.
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, function
Data be also a c dimension vector y, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, can make:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.For given
Input z, our the probability t=c for c=1...C of available each classification can indicate are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
For the operation that the picture descriptive statement generated in step 5, the present invention segment the advanced row of sentence, then to each
Word carries out the analysis of part of speech, using a part of speech resolver and a part of speech corpus about noun, generates one group of band
There is the word binary group of part of speech, part of speech is divided into noun and non-noun by the present invention here.
After the noun for having obtained obtaining from Objective extraction and the noun described from sentence, need to utilize a kind of pass
System connects this two-part noun, and present invention utilizes WordNet to solve this problem.WordNet is a kind of spy
Different english dictionary, WordNet contain the dictionary that many semantic, part of speech information are different from ordinary meaning.WordNet is logical
It can be often grouped with entry difference meaning, synonym collection, that is, synset indicates one group of phrase with identical meanings.
WordNet has done concise introduction to each synonym collection, while according to part of speech, and semanteme connects each synset
It closes.WordNet is the very perfect knowledge base network that will be seen that the part of speech between word word and semantic relation, simultaneously
The also structural information of the classification with part of speech.Therefore can by WordNet by the noun extracted be converted to one group of word to
Amount, by between the noun in the noun and description that are obtained to the available Objective extraction of similarity calculation between term vector
Similarity size illustrates that the content of description is relatively accurate, if similarity is lower, description has if similarity is larger
Institute's error then needs the noun in description being substituted for target noun.
Using WordNet, extract candidate synonym from the synonym word set of WordNet, by WordNet carry out word to
Quantization, obtains the feature of word, calculates characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector.
It, can be by calculating the size of the distance between vocabulary as phase between vocabulary according to the above-mentioned definition for lexical feature
Like the judgment basis of degree.When the distance between two vocabulary is smaller, then illustrate that the similarity between two words is bigger.According to word
Converge similarity value we be more readily available the similarity in WordNet between two vocabulary, Lexical Similarity calculation method
It can be expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument
Inverse, KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.If Similarity (Wi, Wj) value be lower than
1, then it is assumed that the similarity of two words is lower.
Step 6 carries out the description object occurred in iamge description sentence using the image local information extracted in step 3
Specific aim is corrected, and description target error is prevented.
Iamge description of the invention is illustrated in Fig. 5 generates result.
Present invention has the main advantage that
1, the correlation degree for increasing global and local feature in image, global characteristics, Faster are extracted using VGG network
R-CNN network extracts local feature, obtains fusion feature as the defeated of encoder section by the overall situation-Local Feature Fusion algorithm
Out, the influence that garbage is reduced with this enhances the embodiment of key message.
2, it is trained, increases in feature using attention mechanism and two-way LSTM network as decoder section
The attention rate of important information increases the serious forgiveness and generalization ability of training pattern.
3, for noun mistake obvious in descriptive statement, the image object obtained when using local shape factor is believed
Breath correct descriptive statement, increase is retouched based on WordNet term vector similarity calculation with the noun in descriptive statement
The accuracy of predicate sentence.
It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned
Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention
A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.
Claims (9)
1. a kind of Image Description Methods, which is characterized in that comprise the steps of:
The global image feature in picture is extracted using VGG convolutional neural networks;
The local image characteristics in picture are extracted using Faster R-CNN network;
Global image feature and local image characteristics are merged by the overall situation-Local Feature Fusion algorithm, image is obtained and melts
Close feature;
By the two-way length with attention mechanism, memory network handles image co-registration feature in short-term, generates preliminary iamge description
Sentence;
The noun in image object information and preliminary iamge description sentence obtained when being extracted using local image characteristics is carried out
Based on WordNet term vector similarity calculation, the iamge description sentence tentatively generated is corrected, final image is generated and retouches
Predicate sentence.
2. Image Description Methods as described in claim 1, which is characterized in that the image volume of the VGG convolutional neural networks
Product formula are as follows:
Wherein,Indicate j-th point in l feature figure layer,In, MiIndicate the quantity of window,Indicate i-th of unit in l-1 input layer,Indicate i-th of unit of j-th of convolutional layer in l layers,
Indicate j-th of offset in l layers, f indicates an excitation function;
The VGG convolutional neural networks include 5 convolutional layers.
3. Image Description Methods as described in claim 1, which is characterized in that described is mentioned using Faster R-CNN network
The method for taking the local image characteristics in picture includes:
In FasterR-CNN network, original image is converted to one group of characteristic pattern using multiple convolutional layers, RPN network is by feature
Figure training generates candidate regional frame, Pooling layer of ROI got from the regional frame and characteristic pattern of candidate target category with
And return and obtain the final elaborate position of detection block, after extracting target area, screen the target area that specific gravity in picture is greater than P
Domain carries out the operation of convolution feature extraction using VGG network to the target area after screening, obtains the N* as global characteristics
The matrix of N-dimensional;
Wherein, P indicates Target Photo ratio shared in whole picture, and Sobject indicates the area of Target Photo,
Spicture indicates the area of whole picture.
4. Image Description Methods as described in claim 1, which is characterized in that the overall situation-Local Feature Fusion algorithm
Expression formula are as follows:
Wherein, Gf, Lf, Mf respectively indicate global characteristics, local feature and fusion feature;In objective functionHeterogeneous data after respectively indicating projection is away as far as possible, and homogeneous data is as close as possible;Constant
K is balance factor, and value is positive number;Constraint conditionProjection matrix is normalized.
5. Image Description Methods as described in claim 1, which is characterized in that the two-way length with attention mechanism
When memory network processing image co-registration feature method include:
δi=softmax (fatt(hi, sj))
fatt(hi, sj)=tanh (W1hi+W2sj)
Wherein, CiThat indicate is environment vector, hiIndicate current hidden state, sjIndicate the hidden state of front, aijIndicate note
Meaning power probability matrix, δiIt is power added by current state, that is, attention weight, fatt (hi, sj) pay attention to what force function calculated
It is hiAnd sjBetween non-normalized apportioning cost, calculated in the way of connecting entirely;
The quantity of word in sentence, the representation of two-way shot and long term memory unit are indicated using index t=1 ..., N are as follows:
xt=Wωθt
et=f (Wext+be)
Wherein, θ t is the column vector of an instruction index, indicates index vector of the word at t, weight parameter WωIt is one
The embeded matrix of a word, two-way shot and long term memory unit have two independent workflows, and one is length note from left to right
Recall unitAnother be right-to-left length memory unitSt is to obtain t-th of lexeme and week by mapping function f
Position of the lexeme enclosed in sentence, is h dimensional vector, and b indicates bias.
6. Image Description Methods as described in claim 1, which is characterized in that described to be corrected to iamge description sentence
Method includes:
Target noun is obtained from local image characteristics using softmax function;
Parsing obtains description noun from preliminary iamge description sentence;
Target noun is calculated using WordNet and describes the similarity of noun, and the target noun with similarity lower than 1 replaces description
Noun.
7. Image Description Methods as claimed in claim 6, which is characterized in that described uses softmax function from Local map
As the method for obtaining target noun in feature includes:
Assuming that the input data of softmax function is the vector z of c dimension, it is a normalized exponential function, the number of function
According to the vector y for being also a c dimension, the value of the inside is defined as follows between 0 to 1:
Denominator in formula acts as the effect of regular terms, so that:
As the output layer of neural network, the value in softmax function can be indicated with c neuron.
Given input z, the probability t=c for c=1...C for obtaining each classification are indicated are as follows:
Wherein, P (t=1 | z) is indicated, in given input z, which is the probability of c classification.
8. Image Description Methods as claimed in claim 6, which is characterized in that described to be solved from preliminary iamge description sentence
The method that analysis obtains description noun includes:
To the operation that picture descriptive statement is first segmented, the analysis of part of speech is then carried out to each word, utilizes a part of speech
Resolver and a part of speech corpus about noun generate one group of word binary group for having part of speech, part of speech are divided into name
Word and non-noun.
9. Image Description Methods as claimed in claim 6, which is characterized in that described calculates target noun using WordNet
Method with the similarity of description noun includes:
Using WordNet, candidate synonym is extracted from the synonym word set of WordNet, term vector is carried out by WordNet
Change, obtain the feature of word, calculate characteristic set SW:
Feature (SW)={ { WS, { WC}}
Wherein, { WSIndicate that image object extracts the term vector feature of noun, { WCIndicate description in noun term vector;
Lexical Similarity calculation method is expressed from the next:
Wherein, the building WordNet that IDF (wi) indicates that training obtains from WordNet is that some w occuriDocument inverse,
KSIndicate the weight of synonym feature, KCIndicate the weight of generic character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910688842.3A CN110390363A (en) | 2019-07-29 | 2019-07-29 | A kind of Image Description Methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910688842.3A CN110390363A (en) | 2019-07-29 | 2019-07-29 | A kind of Image Description Methods |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110390363A true CN110390363A (en) | 2019-10-29 |
Family
ID=68287863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910688842.3A Pending CN110390363A (en) | 2019-07-29 | 2019-07-29 | A kind of Image Description Methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390363A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909736A (en) * | 2019-11-12 | 2020-03-24 | 北京工业大学 | Image description method based on long-short term memory model and target detection algorithm |
CN111079658A (en) * | 2019-12-19 | 2020-04-28 | 夸氪思维(南京)智能技术有限公司 | Video-based multi-target continuous behavior analysis method, system and device |
CN111310867A (en) * | 2020-05-11 | 2020-06-19 | 北京金山数字娱乐科技有限公司 | Text generation method and device based on picture |
CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
CN111553371A (en) * | 2020-04-17 | 2020-08-18 | 中国矿业大学 | Image semantic description method and system based on multi-feature extraction |
CN111626968A (en) * | 2020-04-29 | 2020-09-04 | 杭州火烧云科技有限公司 | Pixel enhancement design method based on global information and local information |
CN111860235A (en) * | 2020-07-06 | 2020-10-30 | 中国科学院空天信息创新研究院 | Method and system for generating high-low-level feature fused attention remote sensing image description |
CN111916050A (en) * | 2020-08-03 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112232300A (en) * | 2020-11-11 | 2021-01-15 | 汇纳科技股份有限公司 | Global-occlusion adaptive pedestrian training/identification method, system, device, and medium |
CN112257759A (en) * | 2020-09-27 | 2021-01-22 | 华为技术有限公司 | Image processing method and device |
CN112528989A (en) * | 2020-12-01 | 2021-03-19 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN113743096A (en) * | 2020-05-27 | 2021-12-03 | 南京大学 | Crowdsourcing test report similarity detection method based on natural language processing |
CN114049501A (en) * | 2021-11-22 | 2022-02-15 | 江苏科技大学 | Image description generation method, system, medium and device fusing cluster search |
CN114333804A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Audio classification identification method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106447658A (en) * | 2016-09-26 | 2017-02-22 | 西北工业大学 | Significant target detection method based on FCN (fully convolutional network) and CNN (convolutional neural network) |
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Methods based on the building of stratification Attributed Relational Graps |
-
2019
- 2019-07-29 CN CN201910688842.3A patent/CN110390363A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106447658A (en) * | 2016-09-26 | 2017-02-22 | 西北工业大学 | Significant target detection method based on FCN (fully convolutional network) and CNN (convolutional neural network) |
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Methods based on the building of stratification Attributed Relational Graps |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909736A (en) * | 2019-11-12 | 2020-03-24 | 北京工业大学 | Image description method based on long-short term memory model and target detection algorithm |
CN111079658A (en) * | 2019-12-19 | 2020-04-28 | 夸氪思维(南京)智能技术有限公司 | Video-based multi-target continuous behavior analysis method, system and device |
CN111079658B (en) * | 2019-12-19 | 2023-10-31 | 北京海国华创云科技有限公司 | Multi-target continuous behavior analysis method, system and device based on video |
CN111325323B (en) * | 2020-02-19 | 2023-07-14 | 山东大学 | Automatic power transmission and transformation scene description generation method integrating global information and local information |
CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
CN111553371A (en) * | 2020-04-17 | 2020-08-18 | 中国矿业大学 | Image semantic description method and system based on multi-feature extraction |
CN111626968A (en) * | 2020-04-29 | 2020-09-04 | 杭州火烧云科技有限公司 | Pixel enhancement design method based on global information and local information |
CN111310867A (en) * | 2020-05-11 | 2020-06-19 | 北京金山数字娱乐科技有限公司 | Text generation method and device based on picture |
CN113743096A (en) * | 2020-05-27 | 2021-12-03 | 南京大学 | Crowdsourcing test report similarity detection method based on natural language processing |
CN111860235A (en) * | 2020-07-06 | 2020-10-30 | 中国科学院空天信息创新研究院 | Method and system for generating high-low-level feature fused attention remote sensing image description |
CN111916050A (en) * | 2020-08-03 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112257759A (en) * | 2020-09-27 | 2021-01-22 | 华为技术有限公司 | Image processing method and device |
CN112232300A (en) * | 2020-11-11 | 2021-01-15 | 汇纳科技股份有限公司 | Global-occlusion adaptive pedestrian training/identification method, system, device, and medium |
CN112232300B (en) * | 2020-11-11 | 2024-01-19 | 汇纳科技股份有限公司 | Global occlusion self-adaptive pedestrian training/identifying method, system, equipment and medium |
CN112528989B (en) * | 2020-12-01 | 2022-10-18 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN112528989A (en) * | 2020-12-01 | 2021-03-19 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN114049501A (en) * | 2021-11-22 | 2022-02-15 | 江苏科技大学 | Image description generation method, system, medium and device fusing cluster search |
CN114333804A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Audio classification identification method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110390363A (en) | A kind of Image Description Methods | |
CN107330100B (en) | Image-text bidirectional retrieval method based on multi-view joint embedding space | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN109344288A (en) | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism | |
Sharma et al. | A survey of methods, datasets and evaluation metrics for visual question answering | |
CN114936623B (en) | Aspect-level emotion analysis method integrating multi-mode data | |
CN111598183A (en) | Multi-feature fusion image description method | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
CN108509521A (en) | A kind of image search method automatically generating text index | |
CN111949824A (en) | Visual question answering method and system based on semantic alignment and storage medium | |
Li et al. | Multi-modal gated recurrent units for image description | |
Cheng et al. | Stack-VS: Stacked visual-semantic attention for image caption generation | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN117033609A (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN116975615A (en) | Task prediction method and device based on video multi-mode information | |
CN116977844A (en) | Lightweight underwater target real-time detection method | |
CN113378919B (en) | Image description generation method for fusing visual sense and enhancing multilayer global features | |
Guo et al. | Matching visual features to hierarchical semantic topics for image paragraph captioning | |
Nam et al. | A survey on multimodal bidirectional machine learning translation of image and natural language processing | |
US11494431B2 (en) | Generating accurate and natural captions for figures | |
Zheng et al. | Weakly-supervised image captioning based on rich contextual information | |
Pu et al. | Adaptive feature abstraction for translating video to language | |
CN117009570A (en) | Image-text retrieval method and device based on position information and confidence perception | |
CN116932736A (en) | Patent recommendation method based on combination of user requirements and inverted list |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191029 |
|
RJ01 | Rejection of invention patent application after publication |