CN107665356A

CN107665356A - A kind of image labeling method

Info

Publication number: CN107665356A
Application number: CN201710969648.3A
Authority: CN
Inventors: 吕学强; 董志安; 李宝安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2017-10-18
Filing date: 2017-10-18
Publication date: 2018-02-06

Abstract

The present invention relates to a kind of image labeling method, comprise the following steps：Step 1) defines the object function of image labeling model；Image is inputted CNN models by step 2), obtains primitive image features；Step 3) is weighted to primitive image features；Step 4) inputs information into LSTM models；Step 5) carries out backpropagation to error caused by prediction result.Image labeling method provided by the invention, first characteristics of the underlying image is extracted by convolutional neural networks, then the image specific location area characteristics of image related to image labeling word is extracted using focus mechanism to be input in shot and long term memory network model, the corresponding prediction mark word of generation, finally realize image labeling, excellent performance is marked, mark precision is high, can meet the needs of practical application well.

Description

A kind of image labeling method

Technical field

The invention belongs to technical field of image processing, and in particular to a kind of image labeling method.

Background technology

In recent years, researcher is directed to studying semantic understanding of the computer to image always.Automatic image annotation is to allow Computer marks keyword to the entity in image automatically, and it is a kind of key technology in field of image search.With more matchmakers Body information technology and Internet information technique are developed rapidly, and new images hundreds of millions of daily are presented on the internet.And text Originally compare, image can more directly perceived, more accurately description information, therefore can make in the epoch of nowadays information explosion, image User is more convenient, it is faster, more accurately obtain information needed.Image information is increasingly becoming the most heavy of epoch information propagation instantly One of approach wanted.Therefore, how in the view data of such magnanimity help user quickly and accurately find out needed for image into For the study hotspot in multimedia information technique field in recent years.Automatic image annotation technology is as the pass in field of image search One of key technology, turn into the important topic of numerous researcher's researchs.

Automatic image annotation has higher Research Significance and commercial value as the important technology in field of image search. Since 2000 are suggested, numerous researchers throw oneself into correlative study automatic image annotation technology, many automated graphics Mask method is suggested, although these methods improve the accuracy and efficiency of image retrieval to a certain extent.But due to The presence of image " semantic gap ", the current searching system accuracy rate based on automatic image annotation technology are still not enough managed Think, automatic image annotation technology is still in developing stage, and it is prior art that mark performance is not good enough, it is not high enough to mark precision Defect.Nowadays image information has become the important channel of transmission on Internet information.At present, global scale maximum image point Platform Flicker user is enjoyed close to 1,000,000,000, includes over ten billion image.Can in so huge image library fast accurate The image of user's request is retrieved, is the active demand in nowadays big data epoch, and current most of automatic image annotation technologies Prevalent effects are poor in so huge image library, so the new automatic image annotation technical meaning weight under research big data Greatly.

The content of the invention

For above-mentioned problems of the prior art, it is an object of the invention to provide one kind mark excellent performance, mark Note the high image labeling method of precision.

In order to realize foregoing invention purpose, technical scheme provided by the invention is as follows：

A kind of image labeling method, comprises the following steps：

Step 1) defines the object function of image labeling model；

Image is inputted CNN models by step 2), obtains primitive image features；

Step 3) is weighted to primitive image features；

Step 4) inputs information into LSTM models；

Step 5) carries out backpropagation to error caused by prediction result.

Further, the object function in step 1) isWherein y={ y₁..., y_N,θ represent in the model training in need parameter, I representative images；Y represents the mark group finally predicted Close, i.e., final mark word, K represents the quantity of vocabulary in vocabulary, and N represents the number of mark word.

Further, the characteristic pattern of certain layer of convolutional layer before the primitive image features in step 2) are the full articulamentums of CNN, The primitive image features are made up of L D dimensional feature, and each D dimensional features are mapped to the diverse location region of original image.

Further, step 3) is using focusing weight vectors α_tPrimitive image features are weighted, focus on weight Vectorial α_tIt is a L dimensional vector, the weight size of image diverse location feature is represent per one-dimensional value size,

Focus on weight vectors α_t=softmax (W_ee_t), wherein,

e_tThe intermediateness information of t focus mechanism is represented, a represents original Beginning characteristics of image, h_t-1Represent the output of t-1 moment LSTM models.

Further, in step 4), LSTM input information x_t=[W_yy_t-1, W_zz_t], wherein W_yFor Chinese word coding parameter, W_z For characteristics of image coding parameter, wherein y_t-1It is the correct mark word of image, z_tIt is to be weighted using weight parameter is focused at current time Characteristics of image afterwards.

Further, the correct mark phrase Y=(y of image₀, y₁, y₂…y_t…y_n) defeated in order since the t=1 moment Enter among LSTM models, wherein y₀It is a special word " start ", indicates the beginning of annotation process, y_nIt is another Special words " end ", indicate the end of annotation process；y_t-1Through term vector coding parameter W_yLSTM models are input to after coding In；z_tThrough characteristics of image coding parameter W_zIt is input to after coding in LSTM models.

Further, correctly mark word uses one-hot coding form, is made up of a N-dimensional vector, N is represented in word lexicon Number of words, in addition to corresponding mark lexeme is 1, remaining position is 0.

Further, all predictions are marked word using loss function and mark correct log likelihood probabilities by step 5) Negative is taken after value summation, the loss function is defined as

Further, step 5) also includes constantly updating in model using stochastic gradient descent method and chain type Rule for derivation Parameter.

Further, the calculating process formula of LSTM models is as follows：

i_t=σ (W_ixx_t+W_ihh_t-1),

o_t=σ (W_oxx_t+W_ohh_t-1),

f_t=σ (W_fxx_t+W_fhh_t-1),

c_t=f_t⊙c_t-1+i_t⊙h(W_cxx_t+W_chh_t-1),

h_t=o_t⊙c_t,

y_t+1=Softmax (W_yh_t),

Wherein, σ (), h () are activation primitives, and ⊙ is the operation of matrix dot product；i_tIt is input threshold, during for controlling t The input information at quarter；f_tIt is to forget thresholding, for controlling the selective amnesia of the recall info to t-1 moment hidden layers；o_tIt is defeated Go out thresholding, for controlling the output information of t；c_tThe recall info of t hidden layer, it by last moment hidden layer Information and the input information at current time together decide on, and it is LSTM core cell；h_tIt is the output of t hidden layer Information；y_t+1It is h_tThe prediction result obtained by softmax graders.

Image labeling method provided by the invention, it is existing between characteristics of the underlying image and high-level semantic for effectively alleviating Semantic gap problem, proposes a kind of deep neural network image labeling method based on focus mechanism, and this method passes through volume first Product neutral net (CNN) extraction characteristics of the underlying image, then extracts image specific location area and image mark using focus mechanism The related characteristics of image of note word is input in shot and long term memory network (LSTM) model, generates corresponding prediction mark word, finally Realize image labeling；This method effectively combines ability and LSTM extraction figures that CNN extracts characteristics of image by focus mechanism As the ability of semantic feature, characteristics of the underlying image and image high-level semantics features can be utilized, preferably can extract and scheme As semantic related characteristics of image, image labeling precision is effectively improved, marks excellent performance, mark precision is high, can be well Meet the needs of practical application.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the structural representation of the deep neural network image labeling model based on focus mechanism；

Fig. 3 is traditional neural network model basic structure schematic diagram；

Fig. 4 is RNN neural network model conventional structure schematic diagrames；

Fig. 5 is LSTM NE internal structure schematic diagrams.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that specific embodiment described herein is only to explain the present invention, and do not have to It is of the invention in limiting.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.

As shown in figure 1, a kind of image labeling method, comprises the following steps：

Step 1) establishes the deep neural network image labeling model based on focus mechanism, defines the target letter of the model Number：

Wherein, y={ y₁..., y_N,θ represent in the model training in need parameter, I represents one Image；Y represents the mark combination finally predicted, i.e., final mark word, and K represents the quantity of vocabulary in vocabulary, and N represents mark The number of word.Shown in the structure reference picture 2 of image labeling model.

Image I is inputted CNN models by step 2), obtains primitive image features a,

It is that the feature of image diverse location is weighted in view of focus mechanism, therefore the primitive character extracted should include Positional information, the characteristic pattern of each layer and the mapping relations in original image existence position before full articulamentum in CNN models.The present invention The characteristic pattern of certain layer of convolutional layer before the selection full articulamentums of CNN is as primitive image features, and the primitive image features are by L D Dimensional feature forms, and each D dimensional features are mapped to the diverse location region of original image.

Step 3) is weighted using weight vectors are focused on to primitive image features；

Focus mechanism is realized at different moments to the attention rate of diverse location provincial characteristics, this pass to diverse location Note is by focusing on weight α_tTo control.As shown in Fig. 2 since the t=1 moment, each moment, the model can produce one Focus on weight vectors α_t.Focus on weight vectors α_tIt is a L dimensional vector, its vectorial all elements sum is 1, i.e.,It is each The value size of dimension represents the weight size of image diverse location feature.Shown in its calculation formula such as formula (4.4), (4.5),

α_t=softmax (W_ee_t) (4.5)；

Wherein e_tRepresent the intermediateness information of t focus mechanism, as t=0, e₀Obtained from characteristics of image a.Work as t During ＞ 0, e_tBy the output h of t-1 moment LSTM models_t-1With the intermediateness information e of t-1 moment focus mechanisms_t-1Together decide on. e_t-1The memory module in focus mechanism model is can be understood as, all moment are for picture position region before it remembers t Concern information.This process can be needed by before by the picture position region to be paid close attention to of determination current time of intuitivism apprehension Moment picture position area information of interest is (by e_t-1There is provided) and the moment remembers the semantic information in LSTM models before (by h_t-1There is provided).α_tBy e_tLine focus weight decoding parametric W_eObtained again by softmax graders after decoding.Focus mechanism mould The focusing weight α that type obtains when training and starting_tCharacteristics of image can not be focused on accurately to current time prediction mark word Present position in the picture, i.e. application focus on weight α_tCharacteristics of image after obtained weighting and to focus on current time exactly pre- The weighted image of mark note word is characterized in gap being present.As training process is carried out, the parameter W in focus mechanism model_a、W_h、 W_eIt is continuously updated, this gap is also just steadily decreasing, and final focus mechanism model can realize accurate focusing.

The characteristics of image being input to after weighting in LSTM modelsFor the z of t_tBy primitive image features A and t focus on weight α_tMultiplication obtains, for controlling attention rate of the t for image diverse location feature.To t It is input to the image weighted feature z of LSTM models_tThe position focused on is exactly the prediction mark word of the output of t LSTM models Location.

Step 4) inputs information into LSTM models, and the characteristics of image after the correct mark phrase of image and weighting is defeated Enter in LSTM models；

LSTM input information x_t=[W_yy_t-1, W_zz_t], wherein W_yFor Chinese word coding parameter, W_zFor characteristics of image coding parameter, x_tIt is made up of two parts, wherein y_t-1It is that the correct mark word of image (uses one-hot coding form, is made up of a N-dimensional vector, N Represent the number of words in word lexicon.In addition to corresponding mark lexeme is 1,0) remaining position is.The correct mark phrase Y=of image (y₀, y₁, y₂…y_t...y_n) be successively inputted to since the t=1 moment among LSTM models.Wherein y₀It is a special list Word " start ", indicates the beginning of annotation process.y_nIt is another special words " end ", indicates the end of annotation process. y_t-1Through term vector coding parameter W_yIt is input to after coding in LSTM models.x_tAnother part be current time using focusing on weight Characteristics of image z after parameter weighting_t, z_tThrough characteristics of image coding parameter W_zIt is input to after coding in LSTM models.

The output information h of each moment LSTM model hidden layer_tBy exporting decoding parametric W_pPrediction result is obtained after decoding p_t+1, p_t+1=g (W_p·h_t+b_p), wherein g () represents softmax graders.p_t+1Be using LSTM models obtain it is current when Carve the prediction probability of next mark word of LSTM mode inputs mark word.But pass through p_t+1Obtained prediction marks word and worked as There is gap in next correct mark word of preceding moment LSTM mode inputs mark word, i.e. prediction result generates error. Need to carry out backpropagation to this error, to ensure with the training of model, the prediction result at LSTM models each moment with Correct prediction result gap is less and less, finally gives the higher image labeling model of precision.

Step 5) carries out backpropagation to error caused by prediction result, and all predictions are marked into word marks correct log Negative is taken after the summation of likelihood probability value；

This model training process is the backpropagation of error, updates the process of model parameter, defines loss functionThe loss function is that all predictions are marked into the correct log likelihood probabilities value of word mark to ask The result of negative is taken with after.

Updated for parameter, using stochastic gradient descent method (Stochastic Gradient Descent, abbreviation SGD) And chain type Rule for derivation.By training the parameter constantly updated in model so that penalty values L (I, y) is as far as possible small.These parameters Including LSTM model inner parameters, weight parameter (W is focused on_a、W_h、W_e), Chinese word coding parameter W_y, characteristics of image coding parameter W_z, it is defeated Go out decoding parametric W_pDeng (present invention is direct using the CNN model extraction characteristics of image trained, therefore not to CNN model parameters It is updated), above-mentioned parameter is parameter sharing at each moment of model training.

CNN is one kind of feedforward neural network, and it includes two kinds of unique hidden layer structure of convolutional layer and pond layer.CNN With preferable ability in feature extraction, it is widely used in the fields such as image, video, voice at present.

CNN has unique network structure.Its uniqueness is mainly reflected in two aspects.Its next layer during one side Do not connected entirely between neuron and last layer neuron, i.e., be local sensing between its neuron.On the other hand nerve There is identical weight, i.e. the connection of neuron is that weight is shared in first connection procedure.This unique local sensing and power Shared network structure approaches with biological neural network again.Such model can effectively reduce the parameter in network, effectively drop The complexity of low network.CNN has two kinds of unique hidden layer structures, i.e. convolutional layer and pond layer.In CNN a certain layer convolutional layer by A variety of convolution kernel compositions, a convolution kernel is the wave filter of a M*M size, and it is used for extracting each office in last layer receptive field Certain local feature of portion position.Pond layer is used for carrying out dimensionality reduction to last layer convolution feature, and concrete operations are to roll up last layer Product feature is divided into multiple N*N region.The characteristic value of average (or maximum) in each region is extracted as the feature after dimensionality reduction. CNN generally would generally access a softmax grader after a series of convolutional layers, pond layer, full articulamentum, for locating Manage more classification problems.

Recognition with Recurrent Neural Network (Recurrent Neural Network, hereinafter referred to as RNN) has unique memory function Structure.Neural network model includes input layer, hidden layer, output layer three-decker.In traditional neural network model, from input Layer is connectionless per interior nodes from level to level, the node between each layer is in the presence of connection, specific knot to hidden layer to output layer Structure is as shown in Figure 3.This traditional neural network model and the function not comprising recall info, as some needs are by having produced The problem of information is calculated is helpless.If for example, in short, to predict the word of next appearance, greatly Needed in the case of more by above caused vocabulary, such as " I is a basket baller, and I likes to play basketball " such one Word, " playing basketball " inside latter sentence need to be inferred to by " basket baller " in last sentence.RNN models can be with Information caused by the moment it will be remembered and be applied in current time calculating process before, this has benefited from RNN compared to tradition The change that occurs in structure of neural network model, the input of RNN hidden layer is not only defeated comprising current time input layer Go out, also include the output information of last moment hidden layer, i.e., the node inside hidden layer has connection, and concrete structure information is such as Shown in Fig. 4.

LSTM (Long Short-Term Memory, LSTM) is the improved model of RNN models, its internal cellular construction As shown in Figure 5.Shown in calculating process formula (3.1)-(3.6) of LSTM models.Wherein, σ (), h () are activation primitives, ⊙ It is the operation of matrix dot product.i_tIt is input threshold, for controlling the input information of t.f_tIt is to forget thresholding, for controlling to t-1 The selective amnesia of the recall info of moment hidden layer.o_tIt is output thresholding, for controlling the output information of t.c_tWhen being t The recall info of hidden layer is carved, it is together decided on by the hidden layer information of last moment and the input information at current time, and it is LSTM core cell.h_tIt is the output information of t hidden layer.y_t+1It is h_tObtained by softmax graders pre- Survey result.

i_t=σ (W_ixx_t+W_ihh_t-1) (3.1)；

o_t=σ (W_oxx_t+W_ohh_t-1) (3.2)；

f_t=σ (W_fxx_t+W_fhh_t-1) (3.3)；

c_t=f_t⊙c_t-1+i_t⊙h(W_cxx_t+W_chh_t-1) (3.4)；

h_t=o_t⊙c_t(3.5)；

y_t+1=Softmax (W_yh_t) (3.6)。

Image labeling method provided by the invention, it is existing between characteristics of the underlying image and high-level semantic for effectively alleviating Semantic gap problem, proposes a kind of deep neural network image labeling method based on focus mechanism, and this method passes through volume first Product neutral net (CNN) extraction characteristics of the underlying image, then extracts image specific location area and image mark using focus mechanism The related characteristics of image of note word is input in shot and long term memory network (LSTM) model, generates corresponding prediction mark word, finally Realize image labeling；The new distance measure of this image labeling method, has merged the semantic information of image, reduces image bottom Difference between feature and image high-level semantic, it is significant for image, semantic accurate understanding, image can be effectively improved Mark precision；For this method by focus mechanism, the ability and LSTM for effectively combining CNN extraction characteristics of image extract image language The ability of adopted feature, characteristics of the underlying image and image high-level semantics features can be utilized, can preferably be extracted and image language Adopted related characteristics of image, effectively improves image labeling precision, marks excellent performance, mark precision is high, can meet well The needs of practical application.

Embodiment described above only expresses embodiments of the present invention, and its description is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection model of the present invention Enclose.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of image labeling method, it is characterised in that comprise the following steps：

Step 1) defines the object function of image labeling model；

Image is inputted CNN models by step 2), obtains primitive image features；

Step 3) is weighted to primitive image features；

Step 4) inputs information into LSTM models；

Step 5) carries out backpropagation to error caused by prediction result.

2. image labeling method according to claim 1, it is characterised in that the object function in step 1) is

3. according to the image labeling method described in claim 1-2, it is characterised in that the primitive image features in step 2 are CNN The characteristic pattern of certain layer of convolutional layer before full articulamentum, the primitive image features are made up of L D dimensional feature, and each D dimensional features reflect It is mapped to the diverse location region of original image.

4. according to the image labeling method described in claim 1-3, it is characterised in that step 3 is using focusing weight vectors α_t Primitive image features are weighted, focus on weight vectors α_tIt is a L dimensional vector, image is represent not per one-dimensional value size With the weight size of position feature.

Focus on weight vectors α_t=softmax (W_ee_t), wherein,

e_tThe intermediateness information of t focus mechanism is represented, a represents original graph As feature, h_t-1Represent the output of t-1 moment LSTM models.

5. according to the image labeling method described in claim 1-4, it is characterised in that in step 4), LSTM input information x_t= [W_yy_t-1, W_zz_t], wherein W_yFor Chinese word coding parameter, W_zFor characteristics of image coding parameter, wherein y_t-1It is the correct mark word of image, z_tIt is to use the characteristics of image after focusing on weight parameter weighting at current time.

6. according to the image labeling method described in claim 1-5, it is characterised in that the correct mark phrase Y=(y of image₀, y₁, y₂...y_t...y_n) be successively inputted to since the t=1 moment among LSTM models, wherein y₀It is a special word " start ", indicate the beginning of annotation process, y_nIt is another special words " end ", indicates the end of annotation process；y_t-1 Through term vector coding parameter W_yIt is input to after coding in LSTM models；z_tThrough characteristics of image coding parameter W_zIt is input to after coding In LSTM models.

7. according to the image labeling method described in claim 1-5, it is characterised in that correct mark word uses one-hot coding shape Formula, it is made up of a N-dimensional vector, N represents the number of words in word lexicon, and in addition to corresponding mark lexeme is 1, remaining position is 0.

8. according to the image labeling method described in claim 1-7, it is characterised in that step 5) is using loss function by institute There is prediction mark word to take negative after marking correct log likelihood probabilities value summation, the loss function is defined as

<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>I</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>log</mi> <mi> </mi> <msub> <mi>p</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>.</mo> </mrow>

9. according to the image labeling method described in claim 1-8, it is characterised in that step 5) also includes applying under stochastic gradient Drop method and chain type Rule for derivation constantly update the parameter in model.

10. according to the image labeling method described in claim 1-9, it is characterised in that the calculating process formula of LSTM models is such as Under：

i_t=σ (W_ixx_t+W_ihh_t-1),

o_t=σ (W_oxx_t+W_ohh_t-1),

f_t=σ (W_fxx_t+W_fhh_t-1),

c_t=f_t⊙c_t-1+i_t⊙h(W_cxx_t+W_chh_t-1),

h_t=o_t⊙c_t,

y_t+1=Softmax (W_yh_t)。