CN104572940A

CN104572940A - Automatic image annotation method based on deep learning and canonical correlation analysis

Info

Publication number: CN104572940A
Application number: CN201410843484.6A
Authority: CN
Inventors: 张立民; 刘凯; 邓向阳; 孙永威; 张建廷
Original assignee: Naval Aeronautical Engineering Institute of PLA
Current assignee: Naval Aeronautical University
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2015-04-29
Anticipated expiration: 2034-12-30
Also published as: CN104572940B

Abstract

The invention discloses an automatic image annotation method based on deep learning and canonical correlation analysis. The method includes: using a depth Boltzmann machine to extract the high-level feature vectors of images and annotation words, selecting multiple Bernoulli distribution to fit annotation word samples, and selecting Gaussian distribution to fit image features; performing canonical correlation analysis on the high-level features of the images and the annotation words; calculating the Mahalanobis distance between to-be-annotated images and training set images in canonical variable space, and performing weighted calculation according to the distance to obtain high-level annotation word features; generating image annotation words through mean field estimation. The depth Boltzmann machine comprises I-DBM and T-DBM which are respectively used for extracting the high-level feature vectors of the images and the annotation words. Each of the I-DBM and the T-DBM sequentially comprises a visible layer, a first hidden unit layer and a second hidden unit layer from bottom to top. By the method, the problem of 'semantic gap' during image semantic annotation can be solved effectively, and annotation accuracy is increased.

Description

A kind of based on the image automatic annotation method of degree of depth study with canonical correlation analysis

Technical field

The present invention relates to automatic image annotation and retrieval technique, particularly a kind of based on the image automatic annotation method of degree of depth study with canonical correlation analysis.

Background technology

Along with view data presents the growth of geometric series, how to carry out effectively managing and retrieving the study hotspot become in informatization to these view data.Although CBIR technology has had significant progress at present, and there has also been multiple civilian prototype, technology and retrieval product, but because main problem-" semantic gap " is not broken through at all, cause its retrieval effectiveness and mode still not ideal enough.For overcoming these problems, best solution adds the text semantic information relevant to picture material, i.e. image labeling to image.In view of artificial mark also exists the problems such as subjectivity is strong, annotating efficiency is low, automatic image annotation becomes the study hotspot in image labeling field gradually.

First ripe degree of depth learning model starts from the degree of depth belief network that the people such as Hinton in 2002 propose, and this model achieves the abstract expression of data message by multilayer feature extraction mechanism.As powerful probability generation model, successively having there is the various ways such as degree of depth Boltzmann machine, degree of depth autocoder in degree of depth learning model development, and is successfully applied to the fields such as speech recognition, network situation awareness and higher-dimension time series modeling.In image procossing, the Google Brain of Google uses deep neural network in image recognition, obtain huge success, can realize the simulation of part human brain function; In extensive target identification, 5 layers of convolutional network based on degree of depth learning model obtain most high-accuracy in the ImageNet test and appraisal of 2012; On image labeling and classification, the people such as Srivastava achieve good achievement too by building multi-modal degree of depth Boltzmann machine.First of ten quantum jump technology in 2013, degree of depth learning model illustrates powerful vitality and huge energy in machine learning field.

At present, based on degree of depth learning model, good effect has been achieved to Computer image genration mark vocabulary.Multi-modal degree of depth Boltzmann machine solves the Multimodal Learning problem of image and text preferably, and applies at image retrieval and mark.From experimental result, compared to other degree of depth learning models, this modelling effect is better, but still there is gap compared with the automatic image annotation algorithm of classics, and reason is that vocabulary model and top-level feature syncretizing mechanism are not suitable for automatic image annotation task.For this two problems, in conjunction with classical Automatic image annotation algorithm thinking, automatic image marking method based on degree of depth Boltzmann machine and canonical correlation analysis is proposed, employing better can process characteristics of image and generate the degree of depth Boltzmann machine model of higher level of abstraction semantic concept, in conjunction with canonical correlation analysis, designed image automatic marking model, effectively can improve management, the recall precision of large-scale image, and accelerate the processing speed of image information, there is good application prospect and important practicality, economic benefit.

Summary of the invention

For the deficiencies in the prior art, the invention provides a kind of " semantic gap " problem that can overcome linguistic indexing of pictures, realize semantic tagger comparatively accurately based on degree of depth study and the image automatic annotation method of canonical correlation analysis.

Based on degree of depth study and an image automated process for canonical correlation analysis, comprising:

(1) model training data set is built;

(2) the low-level image feature vector extracting image to be marked builds the visual feature vector obtaining respective image;

(3) degree of depth Boltzmann machine model I-DBM that the input of described visual feature vector trains is obtained corresponding image high-level characteristic vector;

(4) described image high-level characteristic is projected in the canonical variable space established, search the image of model labeled data collection adjacent with it, and generate mark vocabulary high-level characteristic vector;

(5) described mark vocabulary high-level characteristic vector is inputted the degree of depth Boltzmann machine model T-DBM trained and marked vocabulary accordingly.

The model training data set of described step (1) is obtained by following steps:

(S11) the mark dictionary comprising several text marking vocabulary is created;

(S12) select the image marked of respective classes as model training data set according to mark dictionary;

The degree of depth Boltzmann machine I-DBM trained in described step (3) is obtained by following steps:

(S31) extracting training data concentrates the low-level image feature of every width image vector to form the visual feature vector obtaining respective image, and determines the mark lexical feature vector of every width image according to mark dictionary and mark vocabulary;

(S32) degree of depth Boltzmann machine model I-DBM is built, described degree of depth Boltzmann machine model comprises visible layer, the first hidden unit layer, the second hidden unit layer from bottom to up successively, any two nodes in each layer are without connection, and any two nodes between adjacent layer are bi-directionally connected;

(S33) utilize the visual feature vector of all images of model training data centralization to described degree of depth Boltzmann machine model training, obtain the degree of depth Boltzmann machine model trained.

The canonical variable space established in described step (4) is obtained by following steps:

(S41) the I-DBM high-level characteristic vector that training data concentrates all images is extracted;

(S42) the T-DBM high-level characteristic vector of the mark word that all images are corresponding in training set is extracted;

(S43) described I-DBM high-level characteristic vector is carried out canonical correlation analysis with T-DBM high-level characteristic vector, obtain projection matrix.

The degree of depth Boltzmann machine T-DBM trained in described step (5) and (S42) is obtained by following steps:

(S51) the mark lexical feature vector of every width image is determined according to mark dictionary and mark vocabulary;

(S52) degree of depth Boltzmann machine model T-DBM is built, described degree of depth Boltzmann machine model comprises visible layer, the first hidden unit layer, the second hidden unit layer from bottom to up successively, any two nodes in each layer are without connection, and any two nodes between adjacent layer are bi-directionally connected;

(S53) utilize the mark lexical feature of model training data centralization all images vector to described degree of depth Boltzmann machine model training, obtain the degree of depth Boltzmann machine model trained.

Low-level image feature based on first extracting image to be marked in degree of depth study and the image automatic annotation method of canonical correlation analysis of the present invention, and the visual feature vector obtaining image is built according to all low-level image features, then direct the visible layer of visual feature vector as degree of depth Boltzmann machine model I-DBM to be inputted, using the second hidden unit layer state of I-DBM as high-level characteristic vector, projected in canonical variable space, search the top n image that distance mahalanobis distance is nearest, according to the degree of depth Boltzmann machine T-DBM second hidden unit layer state that distance weighted generation is new, finally generate the mark vocabulary of new mark vocabulary vector as image by T-DBM.

In degree of depth Boltzmann machine model, high-level semantic obtains by low-level image feature is abstract, because low-level image feature is difficult to be transitioned into high-level semantic, therefore can produce " semantic gap ".Training speed too much can be caused excessively slow in view of hidden unit in practical application counts layer by layer, therefore, two hidden unit layers (being respectively the first hidden unit layer and the second hidden unit layer) are comprised in degree of depth Boltzmann machine model used in the present invention, the middle abstracting power that two hidden unit layers improve degree of depth Boltzmann machine is set, cross over " semantic gap " in linguistic indexing of pictures process, improve mark accuracy rate.

Text eigenvector in described step (S51) is a 0-1 vector (namely in vector, all elements can only be 0 or 1), and described Text eigenvector determines the mark lexical feature vector of each image according to following steps:

(S51-1) initialization full null vector, of making every one dimension corresponding marks vocabulary;

(S51-2) according to the mark word of image, be 1 by the element assignment of corresponding dimension, namely obtain the mark vocabulary vector of this image.

The degree of depth Boltzmann machine model that described step builds, any two nodes in each layer without connection, being bi-directionally connected of any two nodes between adjacent layer.

Described learns the image automatic annotation method with canonical correlation analysis based on the degree of depth, and it is characterized in that, the training process of the degree of depth Boltzmann machine model in described step (S33) (S53) is as follows:

(S53-1) using visual feature vector or mark lexical feature vector as visible layer;

(S53-2) using visible layer and the first hidden unit layer as limited Boltzmann machine, using visual feature vector as the input of visible layer, use and the end-state obtaining connection weights between visible layer and the first hidden unit layer and the first hidden unit layer is trained this limited Boltzmann machine to sdpecific dispersion algorithm;

(S53-3) using the first hidden unit layer and the second hidden unit layer as limited Boltzmann machine, using the end-state of the first hidden unit layer as the end-state of the first hidden unit layer as the input of the first hidden unit layer, use and the end-state obtaining connection weights between the first hidden unit layer and the second hidden unit layer and the second hidden unit layer is trained this limited Boltzmann machine to sdpecific dispersion algorithm.

Canonical correlation analysis process in described step (S43) is as follows:

(S43-1) by described I-DBM high-level characteristic vector and the standardization of T-DBM high-level characteristic vector, association's difference battle array is calculated;

(S43-2) calculate eigenwert and the proper vector of association's difference battle array, carry out sorting and judge whether equal;

(S43-3) by eigenwert according to sequence from big to small, and according to this order proper vector is sorted;

(S43-4) using the row vector of proper vector as matrix, canonical correlation analysis result is obtained.

Described I-DBM model visible layer node number is identical with the dimension of visual feature vector.

In identification and training process, input all using visual feature vector as I-DBM visible layer, therefore each node of I-DBM visible layer must be mutually corresponding with the element of one dimension every in visual feature vector, then the node number of I-DBM visible layer is identical with the dimension of visual feature vector.

Described T-DBM model visible layer node number is identical with vocabulary number in mark dictionary.

In identification and training process, all using the mark vocabulary of image vector as the input of T-DBM visible layer, therefore each node of T-DBM visible layer must be mutually corresponding with vocabulary in mark dictionary, then the node number of T-DBM visible layer is identical with vocabulary number in mark dictionary.

First hidden unit layer and the second hidden unit node layer number of described I-DBM empirically set, and are generally 400 ~ 500, can experimentally effect adjust in actual applications.

Described image low-level image feature vector comprises described bottom layer image proper vector and comprises color layout's description vectors, color structure description vectors, scalable color description vector, edge histogram description vectors, GIST proper vector and the visual word bag vector based on SIFT feature.

Described learns, with the image automatic annotation method of canonical correlation analysis, to it is characterized in that based on the degree of depth, and the visual word bag vector based on SIFT feature is obtained by following steps:

A () calculates the SIFT feature vector of all images of described model training data centralization;

B () is carried out cluster to all SIFT feature vectors and is obtained 500 cluster centres;

C (), using each cluster centre as vision word, adds up each vision word occurrence number in the SIFT feature vector of every width image and forms the visual word bag vector based on the feature of SIFT.

Embodiment

Below in conjunction with instantiation, the present invention is described in further detail.

Based on degree of depth study and an image automatic annotation method for canonical correlation analysis, comprising:

(1) the low-level image feature vector extracting image to be marked builds the visual feature vector obtaining respective image;

In this enforcement, low-level image feature vector comprises color layout's description vectors, color structure description vectors, scalable color description vector, edge histogram description vectors, GIST proper vector and the visual word bag vector based on SIFT feature.

Visual word bag vector based on SIFT feature is obtained by following steps extraction:

C () is using each cluster centre as vision word, add up each vision word in the SIFT feature vector of every width image occurrence number and formed respective image the visual word bag based on SIFT feature vector, the dimension of visual word bag vector equals 500 (equaling the number of cluster centre), and in visual word bag vector, each element is respectively the number of times that in all SIFT feature vectors of respective image, different vision word occurs.

(2) degree of depth Boltzmann machine model I-DBM that the input of the visual feature vector of image to be marked trains is obtained corresponding image high-level characteristic vector;

The degree of depth Boltzmann machine model trained used in step (2) in this example is obtained by following steps:

(S21) extracting training data concentrates the low-level image feature of every width image vector to form the visual feature vector obtaining respective image;

(S22) degree of depth Boltzmann machine model I-DBM is built, described degree of depth Boltzmann machine model comprises visible layer, the first hidden unit layer, the second hidden unit layer from bottom to up successively, any two nodes in each layer are without connection, and any two nodes between adjacent layer are bi-directionally connected;

(S23) utilize the visual feature vector of all images of model training data centralization to described degree of depth Boltzmann machine model training, obtain the degree of depth Boltzmann machine model trained

(3) described image high-level characteristic is projected in the canonical variable space established, search the image of model labeled data collection adjacent with it, and generate mark vocabulary high-level characteristic vector;

The canonical correlation space used in step (3) in this example is obtained by following steps:

(S31) the I-DBM high-level characteristic vector that training data concentrates all images is extracted;

(S32) the T-DBM high-level characteristic vector of the mark word that all images are corresponding in training set is extracted;

(S33) described I-DBM high-level characteristic vector is carried out canonical correlation analysis with T-DBM high-level characteristic vector, obtain projection matrix.

(4) described mark vocabulary high-level characteristic vector is inputted the degree of depth Boltzmann machine model T-DBM trained and marked vocabulary accordingly.

The canonical correlation space used in step (4) in this example is obtained by following steps:

In this example, the canonical correlation analysis of step (S43) is undertaken by following steps:

I-DBM visible layer node number is identical with the dimension of visual feature vector, is 990 dimensions.

T-DBM visible layer node number is identical with the vocabulary number of mark dictionary, is 260 dimensions.

Node number in I-DBM first hidden unit layer and the second hidden unit layer is 400.

T-DBM first hidden unit layer and the second hidden unit layer interior joint number are 200.

Step (S23) and (S42) obtain the degree of depth Boltzmann machine model trained, and concrete training process is as follows:

(S2-1) using visual feature vector or mark lexical feature vector as visible layer;

(S2-2) using visible layer and the first hidden unit layer as limited Boltzmann machine, using visual feature vector as the input of visible layer, use and the end-state obtaining connection weights between visible layer and the first hidden unit layer and the first hidden unit layer is trained this limited Boltzmann machine to sdpecific dispersion algorithm;

(S2-3) using the first hidden unit layer and the second hidden unit layer as limited Boltzmann machine, using the end-state of the first hidden unit layer as the end-state of the first hidden unit layer as the input of the first hidden unit layer, use and the end-state obtaining connection weights between the first hidden unit layer and the second hidden unit layer and the second hidden unit layer is trained this limited Boltzmann machine to sdpecific dispersion algorithm.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.

Claims

1., based on degree of depth study and an image automatic annotation method for canonical correlation analysis, it is characterized in that, comprise:

(1) model training data set is built;

(5) described mark vocabulary high-level characteristic vector is inputted the degree of depth Boltzmann machine model T-DBM trained and marked vocabulary accordingly;

(S33) utilize the visual feature vector of all images of model training data centralization to described degree of depth Boltzmann machine model training, obtain the degree of depth Boltzmann machine model trained;

(S43) described I-DBM high-level characteristic vector is carried out canonical correlation analysis with T-DBM high-level characteristic vector, obtain projection matrix;

2., as claimed in claim 1 based on the image automatic annotation method of degree of depth study with canonical correlation analysis, it is characterized in that, the training process of the degree of depth Boltzmann machine model in described step (S33) (S53) is as follows:

3., as claimed in claim 2 based on the image automatic annotation method of degree of depth study with canonical correlation analysis, it is characterized in that, the node number of described I-DBM visible layer is identical with the dimension of visual feature vector.

4., as claimed in claim 3 based on the image automatic annotation method of degree of depth study with canonical correlation analysis, it is characterized in that, the node number of described T-DBM visible layer is identical with the dimension of Text eigenvector.

5. learn with the canonical correlation analysis process in the automatic image annotation of canonical correlation analysis and step (S44) as follows based on the degree of depth as claimed in claim 4:

(S5-1) by described I-DBM high-level characteristic vector and the standardization of T-DBM high-level characteristic vector, association's difference battle array is calculated;

(S5-2) calculate eigenwert and the proper vector of association's difference battle array, carry out sorting and judge whether equal;

(S5-3) by eigenwert according to sequence from big to small, and according to this order proper vector is sorted;

(S5-4) using the row vector of proper vector as matrix, canonical correlation analysis result is obtained.

6. as described in claim arbitrary in Claims 1 to 5 based on degree of depth study and the image automatic annotation method of canonical correlation analysis, it is characterized in that, described bottom layer image proper vector comprises color layout's description vectors, color structure description vectors, scalable color description vector, edge histogram description vectors, GIST proper vector and the visual word bag vector based on SIFT feature.

7. as claimed in claim 6 based on degree of depth study and the image automatic annotation method of canonical correlation analysis, it is characterized in that, the visual word bag vector based on SIFT feature is obtained by following steps: