CN105184303A

CN105184303A - Image marking method based on multi-mode deep learning

Info

Publication number: CN105184303A
Application number: CN201510198325.XA
Authority: CN
Inventors: 朱松豪; 孙成建; 师哲
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2015-04-23
Filing date: 2015-04-23
Publication date: 2015-12-23
Anticipated expiration: 2035-04-23
Also published as: CN105184303B

Abstract

The invention discloses an image marking method based on multi-mode deep learning. The method comprises the following steps: firstly, a depth neural network is trained by utilization of images without labels; secondly, each single mode is optimized by utilization of counter propagation; finally, weights among different modes are optimized by utilization of on-line learning power gradient algorithm. The method employs a convolution neural network technology to optimize parameters of the depth neural network, and the marking precision is raised. Experiments of public data sets show that the method can raise the image marking performance effectively.

Description

A kind of image labeling method learnt based on the multi-modal degree of depth

Technical field

The present invention relates to a kind of image labeling method, particularly relate to the image labeling method learnt based on the multi-modal degree of depth, belong to technical field of image processing.

Background technology

In recent years, along with the sharp increase of amount of images, people need the efficient mark realizing picture material urgently, to realize effective retrieval and the management of large-scale image.

From the angle of pattern-recognition, image labeling problem is considered as distribute one group of label according to content to image, wherein how chooses the suitable characteristics of token image content, mark performance will be affected to a great extent.Due to well-known semantic gap problem, be difficult to when prior art carries out linguistic indexing of pictures reach gratifying result.In recent years, the people such as Hinton proposes to utilize deep neural network, training characteristics effectively from training set.Dissimilar deep neural network, has been successfully applied to various language and information retrieval.These methods find the data structure hidden and effective characteristic feature by depth structure, degree of depth study from training data, improve system performance.

Summary of the invention

The object of the invention there are provided a kind of image labeling method learnt based on the multi-modal degree of depth, and the method is applied to convolutional neural networks technology, optimizes deep-neural-network parameter, improves mark precision.The method is summed up on the basis of single mode study, realize multi-modal study, wherein both comprise the low-level image feature of research token image, as color, shape or texture etc., also similarity function between dimensioned plan picture and mark is comprised, as linear similarity, cosine similarity and radial distance etc.

The present invention solves the technical scheme that its technical matters takes: the invention provides a kind of image labeling method learnt based on the multi-modal degree of depth, the method comprises the following steps:

Step 1: utilize the image pattern collection without label, the node weights of pre-training deep neural network.

Step 2: adopt back-propagation algorithm, optimize the weight of each single mode.

Step 3: the power gradient algorithm adopting on-line study, optimizes the weight between modality combinations.

Deep neural network described in step 1 of the present invention is the convolutional neural networks of employing eight layers, and wherein the first five layer is convolutional layer, and its excess-three layer is full articulamentum; The output of full articulamentum is as the input of Softmax sorter, and Softmax sorter generates the classifications of 1000 marks; Pre-training and fine setting stage all use the objective function of polynomial expression logistic regression.

In the convolutional layer of the invention described above, ground floor, the second layer, layer 5 are normalization layer, and for maintaining the invariance, all normalization layers all use maximum pool technology.In addition, in all convolutional layers and full articulamentum, all use Serial regulation unit as nonlinear activation function;

In the above-mentioned convolutional neural networks used of the present invention, all input picture size unifications are 256 × 256 sizes; Next, respectively the first two convolution filter is set to 7 × 7 and 5 × 5, step-length is 2, uses this type filter to be for obtaining all band informations, uses little step-length to be produce next layer network influential " dead feature " for avoiding; Then, connect latter three layers of convolutional layer successively, and arrange wave filter size 3 × 3, step-length is 1; Finally, the Output Size of each full articulamentum is 4096.In the pre-training stage, the dropout rate of complete for the first two articulamentum is set to 0.6.

Each single mode step is optimized in backpropagation described in step 2 of the present invention, comprising:

1. single mode pre-training:

Utilize the pre-training of carrying out convolutional neural networks without mark training set, realize the intermediate representation of image object, the network of initialization simultaneously.Detailed process is described below: first, utilizes contrast difference, the node weights W1 between training input layer and first volume lamination; Then, using the input of the conditional probability of first volume lamination node as volume Two lamination:

p(Γ|x ^j)＝S(W ₁,x ^j)(1)

Wherein x _jfor a jth eigenvector, Γ is markup information, the similarity function of S () for being shown below:

\{\begin{matrix} S (W_{1}, x^{j}) = \frac{W_{1}^{T} x^{j}}{| | W_{1} | | | x^{j} | |} & Co \sin e Function \\ S (W_{1}, x^{j}) = W_{1}^{T} x^{j} & Linear Function \\ S (W_{1}, x^{j}) = e^{- \frac{{| | W_{1} - x^{j} | |}^{2}}{2 σ}} & RBF Function \end{matrix} - - - (2)

Then, first volume lamination and volume Two lamination combine training node weights W2; Utilize identical method, train the node weights of remaining 3 layers of convolutional layer and 3 layers of full articulamentum;

2. the single mode fine setting stage:

In the single mode fine setting stage, utilize backpropagation to mark error and optimize node weights.From pattern-recognition angle, the study of many marks can be considered multi-task learning.Therefore, the overall mark error of convolutional neural networks can be considered the summation of each mark error.For l mark error, node optimization process is described below;

First, for image x, it is x under a jth feature mode _j, containing l mark Γ _lprobability can represent by the posterior probability of following formula:

p_{jl} = \frac{\exp (p (Γ_{l} | x^{j}))}{Σ_{k = 1}^{L} p (Γ_{k} | x^{j})} - - - (3)

Wherein L represents mark quantity.

Then, the KL difference between prediction probability and reference probability is minimized.Assuming that every width image has multiple mark, with vector representation y ∈ R ₁× c, wherein y _l=1 represents that the mark of image x is concentrated containing this l mark, and y _l=0 represents that the mark of image x is concentrated not containing this l mark.If q _ilrepresent image x _iand the probability between mark l, then by the error that this l mark correctly distributes to image be:

J_{l} = - Σ_{i = 1}^{M} Σ_{l = 1}^{L} q_{il} \log (p_{il}) - Σ_{i = 1}^{M} Σ_{l = 1}^{L} (1 - q_{i 1}) \log (1 - p_{il}) - - - (4)

The distribution error of all marks is:

J = Σ_{l = 1}^{L} J_{l} - - - (5)

Finally, backpropagation is utilized to upgrade the node weights of other two-layer full articulamentum and five layers of convolutional layer successively.

The power gradient algorithm of the employing on-line study described in step 3 of the present invention optimizes the weight process between different modalities, comprising:

For multi-modal degree of depth network, another vital task is the best of breed weight α=(α of multi-modal of study ₁, α ₂..., α _n..., α _n), wherein by α _nbe initially set to 1/N.The present invention adopts the power gradient algorithm of on-line study to optimize multi-modal weight combination:

α_{t + 1} = \underset{α}{\arg \min} KL (α | α_{t}) + μ h_{t} (α) - - - (6)

Wherein KL (.) represents KL difference, and h (α) represents hinge loss function:

\{\begin{matrix} D_{KL} (u | v) = Σ_{i} u_{i} \ln (\frac{u_{i}}{v_{i}}) \\ h_{t} (α) = \max (0, ψ - α^{T} S_{t}) \end{matrix} - - - (7)

Wherein S _tfor:

S _t＝(S ₁(x,Γ ⁺)-S ₁(x,Γ ^-),...,S _N(x,Γ ⁺)-S _N(x,Γ ^-)) ^T(8)

Wherein mark Γ ⁺with Γ ^-more picture material can be reacted.

At α _tplace carries out first order Taylor to function h (α), and to simplify optimization problem, therefore equation (8) can be written as first order Taylor and launch form:

α_{t + 1} = \underset{α}{\arg \min} KL (α | α_{t}) + μ [h_{t} (α_{t}) + &dtri; h_{t} (α_{t}) (α - α_{t})] - - - (9)

If Γ+with Γ-correctly do not arrange in order, namely robotization renewal is carried out to the value of node weights α.

Beneficial effect:

1, present invention optimizes deep-neural-network parameter, improve mark precision.

2, the present invention achieves the image labeling validity based on deep neural network learning model better.

3, the present invention can improve the performance of image labeling effectively.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Fig. 2 is deep neural network model of the present invention.

Fig. 3 is the example image in natural scene figure storehouse of the present invention.

Fig. 4 is the image of NUS-WIDE image library of the present invention.

Fig. 5 is the example image of IAPRTC-12 image data base of the present invention.

Fig. 6 is in three kinds of common image storehouses of the present invention, the result schematic diagram of different modalities weight combination.

Embodiment

Below in conjunction with Figure of description, the invention is described in further detail.

As shown in Figure 1, the invention provides a kind of image labeling method learnt based on the multi-modal degree of depth, the method comprises: first, utilizes without label image training deep neural network; Secondly, backpropagation is adopted to optimize each single mode; Finally, the power gradient algorithm of employing on-line study optimizes the weight between different modalities.

Deep neural network in the present invention adopts convolutional neural networks, and its model structure as shown in Figure 2.The present invention, by series of experiments, assesses the performance based on multi-modal degree of depth study image labeling algorithm that the present invention proposes.

Step 1: introduce the data set for assessment of algorithm performance.

Experiment employing three common image data sets, comprise natural scene image storehouse as shown in Figure 3, NUS-WIDE image library as shown in Figure 4, and IAPRTC-12 image library as shown in Figure 5.The details of these three image libraries are described below:

Natural scene image storehouse comprises 2000 width images, and all these images comprise following 5 kinds of marks: desert, high mountain, sea, the setting sun and trees.Image more than 20% contains more than one and marks, and the mean value of every width image labeling is 1.3.Fig. 3 provides the example image of two width from natural scene figure storehouse, wherein Fig. 3 (a) be labeled as the setting sun and sea, Fig. 3 (b) is labeled as high mountain and trees.

NUS-WIDE image library comprises 30,000 kind of image, and these image labelings contain canoe, automobile, flag, horse, sky, the sun, tower, aircraft, zebra etc. at interior 31 kinds of marks.Fig. 4 provides the image of two width from NUS-WIDE image library, and wherein the mark of Fig. 4 (a) contains sky and aircraft, and the mark of Fig. 4 (b) contains sea and the setting sun.

IAPRTC-12 image data base comprises 20,000 width image, 291 kinds of marks, and the average mark number of every width image is 5.7.Fig. 5 gives the example image that two width come from IAPRTC-12 image data base.The mark of Fig. 5 (a) contains brown, face, hair, men and women, and the mark of Fig. 5 (b) contains boats and ships, lake, sky, trees.

Step 2: provide the visual signature of token image and the optimized parameter learning to obtain.

Feature selecting has very large impact to system performance.The present invention chooses following global characteristics and the local feature descriptor as characterization image:

Global characteristics: (1) 128 dimension hsv color histogram and 225 dimension LAB color moments, (2) 37 dimension edge orientation histograms, (3) 36 dimension pyramid wavelet textures, (4) 59 dimension local binary feature descriptors, (5) 960 dimension GIST feature descriptors.

Local feature: the partial descriptions symbol adopting two kinds of different sampling methods different with three kinds extracts Local textural feature, and detailed process comprises following description: first, carries out intensive sampling and Harris's Corner Detection; Then, extract SIFT feature, CSIFT feature, RGBSIFT feature, build the code book of 1000 classifications of k mean cluster; Next, adopt secondorder spatial pyramid pattern, build 5000 n dimensional vector ns of every width image; Finally, TF-IDF weight method is used to generate final visual word bag.In whole experiment, in scope that all proper vectors are all standardized in [0,1].

To every group polling-it is right to mark, in above-mentioned formula (4), give 3 kinds of similarity measurements, and select edge parameters μ by cross validation.After cross validation, the μ value in cosine similarity measurement is 0.18; μ value in linear similarity measurement is 1; σ value in RBF similarity measurement is 2, μ value is 0.18.

Step 3: by contrast experiment, test the present invention put forward the performance of algorithm.

Algorithm contrasts

Contrast experiment of the present invention carries out between following three kinds of image classification methods:

Based on inertia learning algorithm: first, for each test pattern, in training image storehouse, find K the most similar individual image; Then, the characteristic of the most similar image of K is added up; Finally, according to the mark of maximum a posteriori probability allocation for test image.

Based on depth representing and encryption algorithm: utilize hierarchical model to learn the expression of image pixel-class, realize image labeling

The inventive method: realize image labeling by deep-neural-network.

Mode weight

In the method for the invention, the combining weights α of different modalities has very large impact to system performance.Fig. 5 provides in three kinds of common image storehouses, the result of different modalities weight combination.Fig. 6 (a): the different modalities combining weights under natural image storehouse.Different modalities combining weights under Fig. 6 (b): NUS-WIDE image.Different modalities combining weights under Fig. 6 (c): IAPRTC-12 image.

Can see easily from the result shown in Fig. 6, the ratio between different modalities does not have significant difference.This just means that often kind of mode is more or less helpful to different images classification, and this is mainly because these three kinds of image libraries comprise many different classes of natural scene images, and this also demonstrates the importance obtaining different modalities optimum combination simultaneously further.

Performance comparison

Table 1 gives several Experimental comparison results making multiaspect image-annotation techniques differently.

Table 1: Experimental comparison results.

As can be seen from result shown in table 1, the NDCG@w performance of institute of the present invention extracting method is better than other two kinds of existing methods, and this checking is based on the image labeling validity of deep neural network learning model.

Claims

1. based on the image labeling method that the multi-modal degree of depth learns, it is characterized in that, described method comprises the steps:

Step 1: utilize the image pattern collection without label, the node weights of pre-training deep neural network;

Step 2: adopt back-propagation algorithm, optimize the weight of each single mode;

2. a kind of image labeling method learnt based on the multi-modal degree of depth according to claim 1, it is characterized in that, the deep neural network of described method step 1 adopts the convolutional neural networks of eight layers, and wherein the first five layer is convolutional layer, and its excess-three layer is full articulamentum; The output of full articulamentum is as the input of Softmax sorter, and Softmax sorter generates the classifications of 1000 marks; Pre-training and fine setting stage all use the objective function of polynomial expression logistic regression;

The ground floor of described convolutional layer, the second layer, layer 5 are normalization layer, and for maintaining the invariance, all normalization layers all use maximum pool technology; In all convolutional layers and full articulamentum, all use Serial regulation unit as nonlinear activation function;

In convolutional neural networks used, all input picture size unifications are 256 × 256 sizes; Next, respectively the first two convolution filter is set to 7 × 7 and 5 × 5, step-length is 2, uses this type filter to be for obtaining all band informations, uses little step-length to be produce next layer network influential " dead feature " for avoiding; Then, connect latter three layers of convolutional layer successively, and arrange wave filter size 3 × 3, step-length is 1; Finally, the Output Size of each full articulamentum is 4096, in the pre-training stage, the dropout rate of complete for the first two articulamentum is set to 0.6.

3. according to claim 1 based on the image labeling method that the multi-modal degree of depth learns, it is characterized in that, the back-propagation algorithm in described method step 2 comprises:

1. single mode pre-training:

Utilize the pre-training of carrying out convolutional neural networks without mark training set, realize the intermediate representation of image object, the network of initialization simultaneously, comprising: first, utilizes contrast difference, the node weights W1 between training input layer and first volume lamination; Then, using the input of the conditional probability of first volume lamination node as volume Two lamination:

p(Γ|x ^j)＝S(W ₁,x ^j)(1)

\{\begin{matrix} S (W_{1}, x^{j}) = \frac{W_{1}^{T} x^{j}}{| | W_{1} | | | | x^{j} | |} & Co \sin e Function \\ S (W_{1}, x^{j}) = W_{1}^{T} x^{j} & Linear Function \\ S (W_{1}, x^{j}) = e^{- \frac{{| | W_{1} - x^{j} | |}^{2}}{2 σ}} & RBF Function \end{matrix} - - - (2)

2. the single mode fine setting stage:

In the single mode fine setting stage, utilize backpropagation to mark error and optimize node weights, from pattern-recognition angle, the study of many marks can be considered multi-task learning; The overall mark error of convolutional neural networks can be considered the summation of each mark error, with l mark error description node optimization process, comprising:

p_{jl} = \frac{\exp (p (Γ_{l} | x^{j}))}{Σ_{k = 1}^{L} p (Γ_{k} | x^{j})} - - - (3)

Wherein L represents mark quantity;

Then, the KL difference between prediction probability and reference probability is minimized; Assuming that every width image has multiple mark, with vector representation y ∈ R ₁× c, wherein y _l=1 represents that the mark of image x is concentrated containing this l mark, and y _l=0 represents that the mark of image x is concentrated not containing this l mark; If q _ilrepresent image x _iand the probability between mark l, then by the error that this l mark correctly distributes to image be:

J_{l} = - Σ_{i = 1}^{M} Σ_{l = 1}^{L} q_{il} \log (p_{il}) - Σ_{i = 1}^{M} Σ_{l = 1}^{L} (1 - q_{il}) \log (1 - p_{il}) - - - (4)

The distribution error of all marks is:

J = Σ_{l = 1}^{L} J_{l} - - - (5)

4. according to claim 1 based on the image labeling method that the multi-modal degree of depth learns, it is characterized in that, the power gradient algorithm of the on-line study of described method step 3 optimizes the weight process between different modalities, comprising:

For multi-modal degree of depth network, another vital task is the best of breed weight α=(α of multi-modal of study ₁, α ₂..., α _n..., α _n), wherein by α _nbe initially set to 1/N; Adopt the power gradient algorithm of on-line study to optimize multi-modal weight combination, comprising:

α_{t + 1} = \underset{α}{\arg \min} KL (α | α_{t}) + μ h_{t} (α) - - - (6)

\{\begin{matrix} D_{KL} (u | v) = Σ_{i} u_{i} \ln (\frac{u_{i}}{v_{i}}) \\ h_{t} (α) = \max (0, ψ - α^{T} S_{t}) \end{matrix} - - - (7)

Wherein S _tfor:

S _t＝(S ₁(x,Γ ⁺)-S ₁(x,Γ ^-),...,S _N(x,Γ ⁺)-S _N(x,Γ ^-)) ^T(8)

Wherein mark Γ ⁺with Γ ^-more picture material can be reacted;

α_{t + 1} = \underset{α}{\arg \min} KL (α | α_{t}) + μ [h_{t} (α_{t}) + &dtri; h_{t} (α_{t}) (α - α_{t})] - - - (9)

5. according to claim 1 based on the image labeling method that the multi-modal degree of depth learns, it is characterized in that, described method is applied to convolutional neural networks.