CN108280187B

CN108280187B - Hierarchical image retrieval method based on depth features of convolutional neural network

Info

Publication number: CN108280187B
Application number: CN201810066649.1A
Authority: CN
Inventors: 余莉; 韩方剑; 罗迤文
Original assignee: Hunan Shunmiao Communication Technology Co ltd
Current assignee: Changsha Lansi Intelligent Technology Co.,Ltd.
Priority date: 2018-01-24
Filing date: 2018-01-24
Publication date: 2021-06-01
Anticipated expiration: 2038-01-24
Also published as: CN108280187A

Abstract

The invention provides a hierarchical image retrieval method based on depth features of a convolutional neural network. Firstly, training a convolutional neural network for feature extraction, and determining network parameters; then, extracting image features by using the trained convolutional neural network to obtain a plurality of convolutional layer binary features and a full-link layer binary feature; secondly, using the plurality of convolution layer binary features in a preliminary screening retrieval stage, further compressing, and then performing multi-feature similarity fusion to screen out a candidate image set and reduce a retrieval range; and finally, accurately searching on the candidate image set by using the binary characteristics of the full-connection layer to obtain a final searching result. The experimental result based on the public image retrieval data set shows that compared with the existing image retrieval method, the method has the advantages that the representation mode of the images is more comprehensive, the feature compression method is simpler and more efficient, the retrieval accuracy is higher, the mode of hierarchical retrieval disperses the system calculation amount, parallel acceleration is facilitated, and the method has practical value.

Description

Hierarchical image retrieval method based on depth features of convolutional neural network

Technical Field

The invention belongs to the field of image processing technology and information retrieval, and relates to a hierarchical image retrieval method implemented by extracting depth features through a convolutional neural network in deep learning.

Background

Under the explosive growth of Image data, fast and effective Retrieval of Image data is an important way to manage massive Image data, and Content-Based Image Retrieval (CBIR) technology has come up under such actual demands. CBIR is an image retrieval method for realizing matching by extracting image content information, and aims to: the user gives a query image, quickly retrieves images which are related and similar to the content of the query image from a large-scale database, and returns the images to the user according to the similarity sequence.

The conventional CBIR system manually extracts visual features in an image, such as color, texture, shape, local feature aggregation Descriptors (VLAD), and the like, to implement an image retrieval function. These manually extracted features also have some limitations: on one hand, for different types of images, different characteristics have different effectiveness, and the generalization capability is poor when the method is applied to a large-scale data set; on the other hand, the features belong to visual shallow features, represent low-level information on the surface of an image, cannot reflect content semantic information of the image, have the problem of 'semantic gap' inconsistent with user understanding, and are difficult to accurately express image content.

With the growth of deep learning in recent years, a Convolutional Neural Network (CNN) is proved to have great advantages in visual characterization, and deep features of the CNN are closer to the understanding level of people to explain image contents, so that the limitation of image feature expression in the traditional method is broken. The conventional CBIR system utilizes CNN as a feature extractor, and represents images by extracting the final full-connection layer features of a network, so that a better retrieval effect is achieved. The CNN is a network structure with multiple hidden layers, the convolutional layers in the middle of the network also have great potential in representing image information, the representation capability of some deep convolutional layers even exceeds that of all-connected layers, the existing CNN image retrieval method only extracts all-connected layer features, neglects the effect of the convolutional layer features, and cannot fully utilize the CNN feature information, so that the waste of the image information in the convolutional layers is caused.

Although the characteristics in the CNN have strong expression capability, the CNN has the characteristic of high dimensionality, and for a large-scale database, extracted characteristic vectors are directly used for representing all images, so that extremely large storage resources and matched calculation cost are required, and the retrieval requirement is difficult to meet. Therefore, the method is very important for feature compression processing, and under the premise of ensuring that effective information is not lost, feature vector redundancy is removed to the maximum extent, and feature dimensions are compressed. The existing feature compression method is commonly provided with Principal Component Analysis (PCA) dimension reduction and hash coding such as local-Sensitive Hashing (LSH), Semantic Hashing (SH) and other modes, and the methods are generally suitable for one-dimensional feature vectors, and when the method is applied to two-dimensional structural features, the method is directly converted into the one-dimensional features for processing, so that structural information of a part of the two-dimensional features is lost, more extra steps are needed for realizing compression operation, and system computation and algorithm complexity are increased.

In summary, the effective characterization of the image is the key to determining the search performance of the CBIR system. The existing CBIR method is based on CNN extraction features, and although improvement is made on the aspect of image representation, the problems that the CNN features are not fully utilized, a feature compression algorithm is separated from the feature structure, and the complexity is high still exist.

Disclosure of Invention

The invention provides an image retrieval method based on CNN depth features, which aims to solve the problems that the traditional CBIR system does not deeply characterize images, extracted features have semantic gaps, and the existing CBIR system based on CNN is insufficient in network feature utilization, a feature compression algorithm is not suitable for two-dimensional features, and algorithm complexity is high.

The technical scheme of the invention is as follows: the image retrieval method adopting a two-stage mechanism comprises primary screening retrieval of a first stage and accurate retrieval of a second stage. The method extracts deep features of an image based on CNN, applies the extracted features of different layers to retrieval of different levels, extracts binary feature maps from modified Linear unit (ReLU) layers of a convolution module and a full-connection module, obtains multi-layer binary feature map vectors through binary conversion, and utilizes the multi-layer features to the maximum extent. The binary characteristic map vector of the convolution module is used for primary screening of the first level, and the binary characteristic map vector of the full-connection module is used for accurate retrieval of the second level. The deep features are combined with a hierarchical retrieval mechanism, the retrieval speed is guaranteed, the retrieval accuracy is improved, and the rapid and accurate retrieval in a large-scale image library is realized.

The method comprises the following steps:

the first step is as follows: setting parameters of the feature extraction network:

CNN is adopted as a feature extraction network, and the network comprises a plurality of convolution modules and a full connection layer module. The process of setting the network parameters comprises the following steps: firstly, classifying and pre-training the network on a big database to determine appropriate initial parameters of the network; and then, performing transfer learning, training and fine-tuning network parameters on the target image library to enable the network parameters to be optimal on the target data set, and finishing the determination of the characteristic extraction network parameters.

The second step is that: extracting binary depth features of the image:

inputting the image into the trained CNN, extracting binary-like deep features after the ReLU layers of the final convolution module and the full-connection module respectively, and obtaining the binary-like feature map vector from the network directly by using the binary activation characteristics that the ReLU is used for zeroing the left half part and the right half part is kept unchanged.

Representing a binary-like feature map extracted from the k-th ReLU layer, which has N feature maps, and thus

The vector has N elements, each element V_i ^k(i 1...., N) characterize an m × m signature, and all elements in the signature are non-negative after passing through the ReLU activation function. (Note: here the size m of the feature map depends on the size of the feature map in which the CNN is placed in the layer)

By mixing all V_i ^kThe non-zero elements in the (i 1.. multidot.n) feature map are set to 1, so that N standard binary feature maps with the size of m × m can be obtained

Composed canonical binary feature map vectors

This step extracts n convolved binary feature map vectors and one fully connected binary feature map vector in the convolution module.

The third step: a preliminary screening retrieval stage:

using n kinds of convolution binary characteristic diagram vectors in the first stage of preliminary screening

k is 1,2,3, …, and n is each independently represented. In order to realize quick retrieval, each feature map in each feature map vector is subjected to a summation operation:

obtaining a compressed feature vector:

respectively representing n types of convolution feature vectors. Respectively using the n convolution characteristic vectors to measure the similarity between the image in the target image library and the query image to obtain corresponding similarity sequences

And T is the number of the images in the target image library. Then, by utilizing a similarity fusion method, the similarities obtained based on the n features are fused into a final global similarity Sim, the sims are ranked from high to low, and the images with the global similarity Sim larger than a threshold Th are taken to form a candidate image library P ═ I₁,I₂,...,I_M}。

The measuring method of the similarity comprises the following steps:

setting feature vectors of query images

And feature vectors of images in the target image library

And m is the size of the original feature map. Setting the initial similarity S of the two feature vectors as 0, and solving for F for the image in each image feature library^(q)And F^(t)Judging the range of the difference value sub of the absolute difference value sub of the corresponding element, and sequentially modifying the similarity according to the following rules: if it is

Then S ═ S + 3; if m/2<sub<m, then S ═ S + 2; if m is less than or equal to sub<2m, then S ═ S + 1;if 2m is less than or equal to sub<3m, then S ═ S-1; if 3m is less than or equal to sub<4m, then S-2; if sub>4m, then S-3. Similarity sequences obtained by all images in the target image library based on the kth characteristic can be obtained

T represents the total number of images in the target image library.

The similarity fusion method comprises the following steps:

first min-max normalized three similarity sequences:

the normalized similarity sequence can be obtained

Because the area under the sorted similarity sequence curve has an inverse correlation with the retrieval performance of the features, the similarity fusion weight of the kth feature is calculated as follows:

wherein

Finally, the global similarity obtained after fusion between the target set image t and the query image q is as follows:

the fourth step: and (3) a precise retrieval stage:

in order to further enhance the retrieval accuracy, the candidate image library P ═ I obtained based on the previous retrieval₁,I₂,...,I_MAnd in the stage, the retrieval measures the similarity by using the fully-connected binary characteristic vector through the Hamming distance: sim (q, t) ═ N-H (q, t), where N is a full linkH (q, t) is the hamming distance between the target image t and the query image q, following the total length of the feature vectors. And re-ordering the candidate image library according to the similarity to obtain a final retrieval result.

The invention has the beneficial effects that:

the invention realizes a hierarchical retrieval mechanism by utilizing deep characteristics of a multilayer neural network. Compared with the existing retrieval method, the following beneficial effects can be obtained:

(1) compared with the shallow feature extracted manually, the deep feature extracted by the deep neural network is closer to human semantic understanding, the barrier of 'semantic gap' is spanned, the content information of the image is better represented, and the retrieval accuracy is greatly enhanced.

(2) According to the invention, the compressed binarization work of the features is completed by utilizing the nonlinear activation function in the neural network, and the binary conversion of the complex features is realized by utilizing the binary activation characteristics of the nonlinear activation function, so that the compressed encoding operation with higher complexity in the existing feature processing method is avoided; when the convolution two-dimensional structure features are compressed, the two-dimensional feature map is used as a unit for operation, so that the condition of separation from the two-dimensional structure is avoided.

(3) The invention adopts a hierarchical retrieval mechanism, introduces deep convolutional layer characteristics in the first-level retrieval, screens out candidate target sets in a large-scale target data set by utilizing multilayer deep convolutional characteristics, and reduces the number of targets for the second-level retrieval. Compared with the existing CNN-based retrieval method, the hierarchical retrieval method can utilize the CNN characteristic information to the greatest extent, improves the retrieval accuracy, simultaneously disperses the system calculation amount, and is beneficial to parallel and accelerated implementation by utilizing a Computer Unified Device Architecture (CUDA) of the English WEIDA company.

Drawings

FIG. 1 is an overall flow diagram of the image retrieval system of the present invention;

FIG. 2 is a schematic diagram of a VGG deep convolutional neural network structure according to the present invention;

FIG. 3 is a diagram of a similarity metric algorithm for the first stage of the search according to the present invention;

FIG. 4 is a graph of normalized similarity curves based on different features;

Detailed Description

FIG. 1 is an overall flow diagram of the image retrieval system of the present invention. The image retrieval process is divided into four steps:

step one, setting parameters of a feature extraction network:

and a VGG network architecture with a deeper layer number in the CNN is adopted as a feature extraction network.

Fig. 2 is a schematic diagram of a network structure of a VGG network.

The VGG network adopts a multi-hidden-layer structure to classify input images, three-channel images with the size of 224 x 224 are input into the network through an input layer, image features are extracted through five convolution modules and a full-connection layer module, and finally the probabilities of all the categories are output at an output layer by utilizing the features. The first four convolution modules adopt the structures of a single convolution layer and a ReLU layer, the fifth convolution module comprises the structures of three convolution layers and a ReLU layer, three times of convolution operation are realized, extraction of deep features is realized, and the features of the convolution layers conv5_1, conv5_2, conv5_3 and a full connection layer fc7 of the 5 th convolution module are used for the invention because the deep features are more advantageous in image content representation.

The setting process of the VGG network parameters is as follows:

1) the network was pre-trained on a large data set ImageNet (ImageNet data set contains over 120 million images, collectively 1000 classes) to determine the appropriate initial parameters for the network. Initializing Gaussian distribution of a network weight parameter W to (0-0.01), initializing network bias to be 0, setting the initial learning rate to be 0.001, reducing the learning rate by 10 times every 100 times of iteration, and performing iterative training until a network loss function is converged;

2) and fine-tuning the pre-trained feature extraction network on the target image library, and fine-tuning network parameters according to the target database to determine the feature extraction network parameters. And taking the pre-trained network parameters in the previous step as initial parameters of the step, reducing the learning rate to 1e-5, finely training the whole network, and performing iterative training until the network loss function is converged to finish the preparation work of the feature extraction network.

Secondly, extracting binary depth features of the target image library and the query image:

inputting the target image library and the query image into a VGG network trained in the previous step, and extracting binary deep feature map vectors from a ReLU layer after convolution layers conv5_1, conv5_2 and conv5_3 in a 5 th convolution module and a ReLU layer after fc7 in a final full-connection module

By combining all feature maps V in the vector_i ^kThe non-zero elements in (i ═ 1.... multidot.n) are set to be 1, and standard binary characteristic map vectors are obtained through conversion

For each image, 3 convolution binary feature map vectors and 1 fully-connected layer of fully-connected binary feature map vectors are obtained.

Step three, a preliminary screening retrieval stage:

(1) the 3 convolution binary characteristic map vectors

Used in the first stage primary screening stage. And performing summation operation on each feature map in each vector:

obtaining a compressed feature vector:

(2) respectively measuring the similarity between the image in the target image library and the query image by using the three convolution characteristic vectors to obtain similarity sequences of the three target image libraries and the query image

FIG. 3 is a similarity metric algorithm chart for the preliminary screening search stage of the present invention.

Feature vector of input query image q

And the feature vector of the image t in the target image library

The convolutional layers conv5_1, conv5_2, conv5_3 have a characteristic diagram size m of 7 and N of 512. The initial similarity S is 0, and F is obtained for each image in the target image library^(q)And F^(t)And (5) modifying the similarity value according to the range of the absolute difference sub of the corresponding element. Outputting a similarity sequence of all images in the target image library based on the three convolution characteristics

(3) And fusing the obtained similarities of the three features into a final global similarity Sim.

FIG. 4 is a normalized similarity graph based on different features.

Respectively calculating to obtain similarity sequences based on five characteristics of conv3, conv4, conv5_1, conv5_2 and conv5_3 convolutional layers in a VGG network, performing min-max normalization on all the similarity sequences, and then sorting the similarity sequences from top to bottom to obtain a curve relation graph of FIG. 4, wherein the abscissa is a reordered similarity serial number, the ordinate is a normalized similarity value, the search performance of different characteristics is measured by using AP (average precision), the legend at the upper right corner reflects the AP value of each curve characteristic, and the characteristics corresponding to different linear curves from top to bottom in the legend of FIG. 4 are sequentially: the higher the AP value, the better the reflective feature search performance, for the features of the conv4, conv5_1, conv3, conv5_3, and conv5_2 convolutional layers. The curve relation shows that the better the retrieval effect is, the deeper the feature layer number is, the closer the normalized similarity curve is to the coordinate axis, the smaller the area under the curve is, and therefore, the similarity weights of different features are set by means of the fact that the areas under different feature normalized curves and the individual retrieval effect of the features form an inverse correlation relation.

(4) Taking a graph with Sim greater than the threshold Th equal to 0.5Image composition candidate image library P ═ { I ═ I₁,I₂,...,I_M}。

The fourth step: and (3) a precise retrieval stage:

the binary feature map vector of full-connected fc7 is used for the second stage of fine search. Based on candidate image library P ═ { I ═ I₁,I₂,...,I_MThe stage searches for a hamming distance H (q, t) between a target image t and a query image q to obtain a similarity sim (q, t) ═ N-H (q, t), where N ═ 4096 is the length of the fc7 feature vector. And sorting the images in the candidate set from big to small according to the similarity to obtain a final retrieval result.

The image retrieval method designed by the invention is compared with other retrieval methods. Tables 1 and 2 are the comparison results of the feature compression method obtained based on the open data sets Inria Holidays and Oxford Buildings test, respectively, and the retrieval performance in the tables is measured by the average accuracy AP. On two public retrieval test sets, the feature compression method provided by the invention has good performance on various features, and has higher retrieval accuracy compared with other common six feature compression methods.

We also compared the image retrieval system of the present invention with the conventional retrieval system based on VLAD feature, the retrieval system based on fully connected layer feature and the retrieval system based on fully connected layer feature after sum down-sampling and PCA dimension reduction operation with better retrieval effect in CNN, and the results are shown in table 3. The invention improves the retrieval accuracy to a certain extent under the condition of not increasing the system complexity, and compared with a plurality of comparison retrieval systems in a table, the retrieval accuracy is higher and the advantages are obvious.

TABLE 1 characteristic compression method comparison table based on Inria Hollidays dataset

TABLE 2 characteristic compression method comparison table based on Oxford Buildings data set

TABLE 3 retrieval accuracy rate comparison table for different image retrieval systems

Claims

1. A hierarchical image retrieval method based on depth features of a convolutional neural network is characterized by comprising the following steps:

a convolutional neural network is adopted as a feature extraction network, and a transfer learning method is adopted to set network parameters:

performing classification pre-training on the network on a large database to determine appropriate initial parameters of the network;

secondly, training and fine-tuning network parameters on a target image library to enable the network parameters to be optimal on a target data set, and determining the characteristic extraction network parameters;

the second step is that: extracting binary depth features of the image:

extracting a class binary feature vector from a network: inputting the image into a trained convolutional neural network, and extracting binary deep features after correcting Linear Units (ReLU) layers of a final convolutional module and a full-connection module respectively; let the class binary feature map extracted from the kth ReLU layer be

The layer has N characteristic maps, and each characteristic map V is set_i ^k(i 1.. said., N) has a size of m × m;

(II) binary feature vector: by mixing all V_i ^k(i 1.... N.) in the feature mapSetting non-zero elements to 1, and setting each V_i ^k(i 1.... N.) is converted to a standard binary signature

Obtaining a standard binary feature map vector

Extracting n convolution binary characteristic diagram vectors and a full-connection binary characteristic diagram vector in a convolution module;

the third step: a preliminary screening retrieval stage:

using n convolution binary characteristic diagram vectors in the first stage of preliminary screening

k is 1,2, …, and n represents each;

feature vector compression: and performing summation operation on each feature map in each convolution feature vector:

obtaining a compressed feature vector:

(II) similarity measurement: respectively using the n convolution characteristic vectors to measure the similarity between the image in the target image library and the query image, wherein the measuring method of the similarity comprises the following steps:

setting feature vectors of query images

And feature vectors of images in the target image library

m is the size of the original characteristic diagram; setting the initial similarity S of the two eigenvectors as 0;

(1) for each target image in the libraryImage, find F^(q)And F^(t)The absolute value sub of the difference of the corresponding element;

(2) and aiming at each element difference value sub of the feature vector, sequentially modifying the similarity according to the following rules:

if it is

Then S ═ S + 3; if m/2<sub<m, then S ═ S + 2;

if m is less than or equal to sub 2m, then S is S + 1; if 2m is less than or equal to sub 3m, S is S-1;

if 3m is less than or equal to sub 4m, S is S-2; if sub 4m, S ═ S-3;

obtaining a similarity sequence obtained by all images in the target image library based on the kth characteristic

T represents the total number of images in the target image library;

and (III) fusing the similarity of multiple features: and fusing the obtained similarities of the n features into a final global similarity Sim, wherein the method for fusing the similarities comprises the following steps:

(1) min-max normalized three similarity sequences:

the normalized similarity sequence can be obtained

(2) Because the area under the sorted similarity sequence curve has an inverse correlation with the retrieval performance of the features, the similarity fusion weight of the kth feature is calculated as follows:

wherein

(3) The global similarity after fusion between the target set image t and the query image q is as follows:

sorting Sim from high to low, and taking the images with the global similarity Sim larger than a threshold Th to form a candidate image set P ═ { I ═ I₁,I₂,...,I_M}；

The fourth step: and (3) a precise retrieval stage:

based on candidate image set P ═ { I₁,I₂,...,I_MAnd measuring the similarity by using the full-connection binary characteristic vector through a Hamming distance: sim (q, t) ═ N-H (q, t), where N is the total length of the fully-concatenated feature vectors and H (q, t) is the hamming distance between the target image t and the query image q; and sequencing the candidate image sets according to the similarity to obtain a final retrieval result.