CN110110120B

CN110110120B - Image retrieval method and device based on deep learning

Info

Publication number: CN110110120B
Application number: CN201810597022.9A
Authority: CN
Inventors: 曾凡锋; 胡胜达; 王宝成
Original assignee: North China University of Technology
Current assignee: Jiangsu Tianfenghaizhiyuan Communication Power Technology Co ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2021-05-25
Anticipated expiration: 2038-06-11
Also published as: CN110110120A

Abstract

The invention relates to an image retrieval method and device based on deep learning. The method comprises the following steps: building a deep neural network; inputting the images of the training set into a deep neural network for training; loading a trained deep neural network model, inputting all images of a training set into the deep neural network to obtain binary hash codes with semantic information, simultaneously obtaining the binary hash codes with visual information by adopting a traditional binary hash coding method, and establishing a local feature library; inputting an image to be retrieved into a deep neural network, acquiring a corresponding binary hash code with semantic information, acquiring the binary hash code with visual information of the image to be retrieved by adopting a traditional binary hash coding method, comparing the binary hash code with a local feature library, and calculating the similarity to obtain a retrieval result. The invention can keep the important information of the image as much as possible and can realize the rapid and accurate image retrieval of mass image data.

Description

Image retrieval method and device based on deep learning

Technical Field

The invention belongs to the application of deep learning in the field of image processing, and particularly relates to a method and a device for performing binary hash coding on an image by adopting a deep neural network and using the binary hash coding for retrieval.

Background

The image retrieval technology aims at image contents which are interested by a user, and presents related images to the user in a way that the similarity is from top to bottom according to a specific similarity measurement standard. The core problem is how to condense the information of the image, obtain the feature descriptor of the image and fully express the content information of the image.

The traditional image retrieval technology is used for extracting image features aiming at basic features such as texture, color, shape and the like of an image and calculating the similarity by adopting a corresponding image similarity measurement method. However, these basic image features cannot depict semantic content of an image, and in practical applications, image content concerned by a user is often semantic-level, such as concepts of scenes, objects, and the like in the image. In addition, with the development of the internet era, image retrieval tasks tend to be large-scale and diversified at the present stage, and an image retrieval algorithm is required to be capable of finding out an image which is interested by a user from massive image data in a shorter time, so that the image retrieval technology is more difficult.

In view of the above problems, in recent years, people have tried to extract features of an image by using a deep neural network, and achieve a certain result by using the strong image expression capability and the efficient calculation efficiency of binary hash coding of the deep neural network, especially the deep convolutional neural network. However, there are some limitations to the existing methods, especially the configuration of the pooling layer. In the structural design of the deep neural network, the pooling layer plays an important role. As the number of layers of the neural network continuously increases, on one hand, the pooling layer plays a role of down-sampling, so that the interference information of the image is reduced, and the important information of the image is obtained; on the other hand, the down sampling reduces the number of pixels of the image, reduces parameters of the neural network to a certain extent, and solves the problems of excessive network parameters, memory consumption and difficulty in convergence. In the existing design network, people often adopt a single pooling method, especially a maximum pooling method, to realize a pooling layer. Aiming at the defect, the invention provides a deep neural network structure combining multiple pooling methods to improve the image retrieval effect.

Disclosure of Invention

The invention provides a novel image retrieval method and device based on deep learning, aiming at the limitation of the existing image retrieval method based on deep learning.

Aiming at the defect that the conventional image retrieval method based on deep learning usually adopts a single pooling method in the design process of a network, the invention adopts two pooling methods of maximum pooling and average pooling for the output structure of each layer in the design process of a network structure, thereby greatly retaining the semantic information of the image and realizing better image retrieval effect. And further combining the hash code with semantic information acquired from the deep neural network with a plurality of traditional hash coding methods to acquire a final image retrieval result. The image retrieval method is based on deep learning, and the deep neural network is adopted for image retrieval, namely, the binary Hash codes of the images are obtained by constructing and training the corresponding deep neural network for the retrieval process of the images.

Specifically, the invention adopts the following technical scheme:

an image retrieval method based on deep learning, comprising the following steps:

1) constructing a deep neural network for binary Hash coding extraction of the image;

2) inputting the images of the training set into the deep neural network in batches for model training of the deep neural network, and storing the trained deep neural network model;

3) loading a deep neural network model, inputting all images of a training set into the deep neural network to obtain binary hash codes with semantic information, simultaneously obtaining the binary hash codes with visual information (texture, color and shape) by adopting a traditional binary hash coding method, and establishing a local feature library by utilizing the binary hash codes with the semantic information and the binary hash codes with the visual information;

4) inputting an image to be retrieved into a deep neural network model, acquiring a corresponding binary hash code with semantic information, acquiring the binary hash code with bottom-layer visual information of the image to be retrieved by adopting a traditional binary hash coding method, comparing the acquired binary hash code with a local feature library, and acquiring a retrieval result through similarity.

Further, the data set partitioning method of the deep neural network in step 1) comprises the following steps: if a large number of available training images do not exist in an actual application scene, an image data set close to the actual application scene is additionally taken as a training set, otherwise, the existing data set is divided into the training set and a testing set, each image has a respective label, the labels are set according to actual requirements, and the label of each image can be not unique.

Further, each pooling layer of the deep neural network in the step 1) has an average pooling layer and a maximum pooling layer, the number of network output layers of each pooling layer increases with the increase of the network depth, and the sizes of feature maps output by the same pooling layer are the same; all the output feature maps of the pooling layer are connected in series in a specific network layer, and are subjected to convolution operation again through a plurality of convolution layers, so that the feature fusion effect of all the feature maps is realized; and in the output part of the deep neural network, converting the output of the deep neural network into vectors by adopting a global average pooling layer, and finally reducing the output of the deep neural network to the specified binary hash coding length through a plurality of full connection layers.

Further, step 2) training the deep neural network by using a contrast loss function, wherein the contrast loss function is calculated by using the following steps:

a) given an arbitrary batch of images I ═ I₀,I₁,...I_N-1Y ═ Y, its label₀,y₁...y_N-1N is the size of a batch of images, and any image pair is taken as I_i、I_j，i,j∈[0,N]With loss value L of the contrast loss function_pair(I_i,I_j)：

Wherein δ (I)_i,I_j) Indicating whether the image pairs are similar, d (I)_i,I_j) Representing the distance between the image pair, margin is the edge coefficient. Delta (I)_i,I_j) And d (I)_i,I_j) Are defined as follows:

wherein y is_iAnd y_jAre labels, y 'of the i, j images respectively'_i,kAnd y'_j,kThe output values of the ith and the jth images of the neural network at the kth node respectivelyM is the number of the output nodes of the neural network, and the length of the output nodes is the same as that of the extracted binary hash code in terms of value;

b) there is a total loss value L for all image pairs within a batch:

binding of L_pair(I_i,I_j) Definition, L is rewritten as:

further, step 3) combines the binary hash codes with semantic information obtained by the neural network and the binary hash codes with visual information obtained by adopting a plurality of traditional binary hash coding methods as the characteristics of the image.

Further, step 4) adopts the following similarity calculation method to measure the similarity between the images:

a) given image I to be retrieved_testInputting the data into a deep neural network to obtain binary Hash code H with semantic information₀:

H₀＝sign(Y')

Wherein Y' is the output of the neural network, sign (#) is a sign taking function;

b) assuming that n traditional image hash code generation methods are adopted, the obtained binary hash codes are respectively H₁,H₂...H_nThen, I_testTotal binary hash encoding H_test＝{H₀,H₁,...H_n}; then for each image I from the feature library_trainIts binary hash encoding is H'_train＝{H'₀,H'₁,...H'_n}；

c) Calculating similarity sim (I) by adopting one of the following two schemes_test,I_train)：

The first method comprises the following steps:

and the second method comprises the following steps:

wherein λ_i∈(0,1]And lambda₀+λ₁+...λ_n＝1，s(H_i,H'_i) Comprises the following steps:

wherein h is_k、h'_kAre respectively H_iAnd H'_iValue of the k-th digit, z_iThe number of bits for the ith hash code.

An image retrieval device based on deep learning, comprising:

the deep neural network building module is responsible for building a deep neural network and is used for extracting the binary Hash codes of the image;

the deep neural network training module is responsible for inputting the training set images into the deep neural network in batches for model training of the deep neural network and storing the trained deep neural network model;

the local feature library establishing module is responsible for loading a deep neural network model, inputting all images of a training set into the deep neural network, acquiring binary hash codes with semantic information, acquiring the binary hash codes with visual information by adopting a traditional binary hash coding method, and establishing a local feature library by utilizing the binary hash codes with the semantic information and the binary hash codes with the visual information;

and the retrieval module is responsible for inputting the image to be retrieved into the deep neural network, acquiring the corresponding binary hash code with semantic information, acquiring the binary hash code with visual information of the image to be retrieved by adopting a traditional binary hash coding method, comparing the acquired binary hash code with the local feature library, and acquiring a retrieval result by calculating the similarity.

The semantic information refers to the image features obtained from the neural network. The image features obtained from the neural network are generally considered to be high-level semantic features, namely, the image features are not only superficial visual expression of the image, but also expression of concepts such as an object, a scene and the like.

The pooling method of the invention can greatly retain the semantic information of the image because the invention adopts a plurality of pooling methods on the same layer of the network, the pooling methods have differences in information screening, and the information of the plurality of pooling methods is retained in the pooling process of each layer.

The rapid image retrieval method based on deep learning emphasizes improvement on the structure of a network, retains important information of images as much as possible by adopting various pooling methods, improves the accuracy of image retrieval, and simultaneously utilizes the inherent calculation overhead advantage of binary hash coding to realize rapid and accurate image retrieval in massive image data.

Drawings

FIG. 1 is a flow chart of the main steps of the method of the present invention.

Fig. 2 is a network structure diagram of the deep neural network of the present invention.

Detailed Description

The present invention will be described in detail below with reference to examples and the accompanying drawings.

Step 1: if a large number of available training images are not available in an actual application scene, an image data set close to the application scene is additionally taken as a training set, otherwise, the existing data set is divided into the training set and a testing set, each image has a respective label, the labels are set according to actual requirements, and the label of each image can be not unique.

Step 2: a deep neural network as shown in fig. 2 was constructed and designed. Each layer of the pooling layer of the network has an average pooling layer and a maximum pooling layer, the number of network output layers of each layer of the pooling layer is increased along with the increase of the network depth, and the feature graphs output by the same layer of the pooling layer have the same size. And (3) connecting all the output feature maps of the pooling layer in series (namely 'splicing' in fig. 2) at a specific network layer, and performing convolution operation again through a plurality of convolution layers to realize the feature fusion effect of all the feature maps. And in the output part of the deep neural network, converting the output of the deep neural network into vectors by adopting a global average pooling layer, and finally reducing the output of the deep neural network to the specified binary hash coding length through a plurality of full connection layers.

And step 3: and randomly dividing the images of the training set into a plurality of small batches to be input into the neural network, and performing deep neural network training by adopting a contrast loss function. Given an arbitrary batch of images I ═ I₀,I₁,...I_N-1Y ═ Y, its label₀,y₁...y_N-1N is the size of a batch of images. Get any intention pair I_i、I_j,i,j∈[0,N]With loss value L of the contrast loss function_pair(I_i,I_j)：

Wherein, delta (I)_i,I_j) Indicating whether the image pairs are similar, d (I)_i,I_j) Representing the distance between the image pair, margin is the edge coefficient. Delta (I)_i,I_j) And d (I)_i,I_j) Are defined as follows:

wherein, y_iAnd y_jAre labels, y 'of the i, j images respectively'_i,kAnd y'_j,kRespectively the output values of the ith image and the jth image of the neural network at the kth node,and m is the number of the output nodes of the neural network, and the length of the output nodes is the same as that of the extracted binary hash codes in numerical value.

Therefore, there is a total loss value L for all image pairs within a batch:

binding of L_pair(I_i,I_j) By definition, L can be rewritten as:

and 4, step 4: loading a trained neural network model, inputting all training set images into a neural network to obtain corresponding binary Hash codes with semantic information, and simultaneously carrying out Hash coding on the images of the training set by adopting various traditional Hash coding methods such as common Hash, locality sensitive Hash and the like to obtain the binary Hash codes with visual information. And combining the two binary hash codes to establish a local image feature library.

And 5: inputting the images of the test set into a neural network, repeating the step 4) to obtain two types of binary hash codes, comparing the binary hash codes with the binary hash codes of all training images of the local feature library, and returning the results of the image retrieval to the user from top to bottom according to the similarity. The similarity calculation method is as follows:

given test image I_testInputting the binary hash code into a neural network to obtain a binary hash code H with semantic information₀:

H₀＝sign(Y')

Wherein, Y' is the output of the neural network, sign (#) is a sign taking function.

Assuming that n traditional image hash code generation methods are adopted, the obtained binary hash codes with visual information are respectively H₁,H₂…H_nThen, I_testTotal binary hash encoding H_test＝{H₀,H₁,...H_n}. Then for each image I from the feature library_trainIts binary hash encoding is H'_train＝{H'₀,H'₁,...H'_nThe similarity sim (I) is shown as the following two kinds_test,I_train) And (3) calculating a scheme:

the first method comprises the following steps:

and the second method comprises the following steps:

In order to verify the effectiveness of the method, the invention carries out comparison tests on public data sets CIFAR-10 and NUS-WIDE, and compares the method with the existing method.

The CIFAR-10 dataset contains 60,000 images, which the authorities classify into 10 categories of 50,000 training images and 10,000 test images. The present invention uses a given 50,000 training images to train a deep neural network, randomly partitioning 10,000 test images into 1,000 query sets and 9,000 image sets.

The NUS-WIDE dataset contains 19,5834 images from Flickr, for a total of 81 classes. The invention uses the most common partitioning method on this data set to select the 21 most common categories, comprising 19,5834 images, each category containing at least 5,000 images. Then, 10,000 images are randomly selected from the images as a test set, and the rest images are used as a training set. Finally, 1,000 pieces of images are randomly selected from the test set as a query set, and the rest 9,000 pieces of images are used as the queried image set.

In order to simplify the experiment, the invention only adopts the proposed binary hash codes with semantic information to carry out experiment comparison, and thoroughly inspects the average accuracy of various methods under different hash code lengths. The results of the experiment are shown in table 1. From experimental results, the method provided by the invention has better image retrieval effect than the existing advanced method no matter on a CIFAR-10 data set or a NUS-WIDE data set.

TABLE 1 average accuracy of image retrieval at different Hash code lengths

In Table 1 above, the CNNH + method reference, "Supervised Hashing for Image Retrieval Learning"; DNNH methods reference "Simultaneous feed Learning and Hash Coding with Deep Neural Networks"; DLBHC methods reference "Deep Learning of Binary Hash Codes for Fast Image Retrieval"; DSH methods references "Deep Supervised Hashing for Fast Image Retrieval"; SUBIC method reference "SUBIC A super, structured bind code for image search".

The pooling method employed in the present invention at each level is not limited to average pooling and maximum pooling, but may be a variety of combinations of existing pooling methods.

The above-mentioned embodiments and the drawings are only for illustrating the technical principles of the present invention and are not to be construed as limiting the present invention. The technical solution of the present invention can be changed and modified equally by those skilled in the art, and the protection scope of the present invention should be subject to the limitation of the claims.

Claims

1. An image retrieval method based on deep learning is characterized by comprising the following steps:

1) constructing a deep neural network for extracting binary Hash codes of the image;

2) inputting the images of the training set into the deep neural network in batches, training the model of the deep neural network, and storing the trained deep neural network model;

3) loading a deep neural network model, inputting all images of a training set into the deep neural network, acquiring binary hash codes with semantic information, acquiring the binary hash codes with visual information by adopting a traditional binary hash coding method, and establishing a local feature library by utilizing the binary hash codes with the semantic information and the binary hash codes with the visual information;

4) inputting an image to be retrieved into a deep neural network, acquiring a corresponding binary hash code with semantic information, acquiring the binary hash code with visual information of the image to be retrieved by adopting a traditional binary hash coding method, comparing the acquired binary hash code of the image to be retrieved with a local feature library, and acquiring a retrieval result by calculating similarity;

step 1) each layer of the pooling layer of the deep neural network is provided with an average pooling layer and a maximum pooling layer, the number of network output layers of each layer of pooling layer is increased along with the increase of the network depth, and the sizes of characteristic graphs output by the same layer of pooling layer are the same; all the output feature maps of the pooling layer are connected in series in a specific network layer, and are subjected to convolution operation again through a plurality of convolution layers, so that the feature fusion effect of all the feature maps is realized; in the output part of the deep neural network, converting the output of the deep neural network into vectors by adopting a global average pooling layer, and finally reducing the output of the deep neural network to the specified binary hash coding length through a plurality of full connection layers;

step 2) training the deep neural network by adopting a contrast loss function, wherein the contrast loss function is calculated by adopting the following steps:

a) given an arbitrary batch of images I ═ I₀,I₁,...I_N-1Y ═ Y, its label₀,y₁...y_N-1N is the size of a batch of images, and any image pair is taken as I_i、I_j,i,j∈[0,N]With loss value L of the contrast loss function_pair(I_i,I_j)：

Wherein, delta (I)_i,I_j) Indicating whether the image pairs are similar, d (I)_i,I_j) Representing the distance between the image pairs, margin being an edge coefficient;

δ(I_i,I_j) And d (I)_i,I_j) Are defined as follows:

wherein, y_iAnd y_jAre labels, y 'of the i, j images respectively'_i,kAnd y'_j,kThe output values of the ith and the j images of the neural network at the kth node and the number of the output nodes of the m neural network are the same as the length of the extracted binary hash codes in terms of numerical values;

b) there is a total loss value L for all image pairs within a batch:

binding of L_pair(I_i,I_j) Definition, L is rewritten as:

2. the method of claim 1, wherein the deep neural network data set partitioning method of step 1) is: if a large number of available training images do not exist in an actual application scene, an image data set close to the actual application scene is additionally taken as a training set, otherwise, the existing data set is divided into the training set and a testing set, each image has a respective label, the labels are set according to actual requirements, and the label of each image can be not unique.

3. The method of claim 1, wherein step 3) combines the binary hash code with semantic information obtained by the neural network with the binary hash code with visual information obtained by using a plurality of conventional binary hash coding methods as the features of the image.

4. The method of claim 1, wherein step 4) measures the similarity between images using the following similarity calculation method:

H₀＝sign(Y')，

Wherein, Y' is the output of the neural network, sign (#) is a sign taking function;

b) assuming that n traditional image hash code generation methods are adopted, the obtained binary hash codes are respectively H₁,H₂…H_nThen, I_testTotal binary hash encoding H_test＝{H₀,H₁,...H_n}; then for each image I from the feature library_trainIts binary hash encoding is H'_train＝{H'₀,H'₁,...H'_n}；

c) Calculating the similarity sim (I) by adopting one of the following two schemes_test,I_train)：

The first method comprises the following steps:

and the second method comprises the following steps:

λ₀＝0，

5. An image retrieval apparatus based on deep learning, characterized by comprising:

the deep neural network building module is responsible for building a deep neural network and is used for extracting binary Hash codes of the image;

the deep neural network training module is responsible for inputting the training set images into the deep neural network in batches, training the model of the deep neural network and storing the trained deep neural network model;

the retrieval module is responsible for inputting an image to be retrieved into a deep neural network, acquiring a corresponding binary hash code with semantic information, acquiring the binary hash code with visual information of the image to be retrieved by adopting a traditional binary hash coding method, comparing the acquired binary hash code of the image to be retrieved with a local feature library, and acquiring a retrieval result by calculating similarity;

each layer of the pooling layer of the deep neural network is provided with an average pooling layer and a maximum pooling layer, the number of network output layers of each layer of the pooling layer is increased along with the increase of the network depth, and the sizes of feature graphs output by the same layer of the pooling layer are the same; all the output feature maps of the pooling layer are connected in series in a specific network layer, and are subjected to convolution operation again through a plurality of convolution layers, so that the feature fusion effect of all the feature maps is realized; in the output part of the deep neural network, converting the output of the deep neural network into vectors by adopting a global average pooling layer, and finally reducing the output of the deep neural network to the specified binary hash coding length through a plurality of full connection layers;

the deep neural network training module trains the deep neural network by adopting a contrast loss function, and the contrast loss function is calculated by adopting the following steps:

Wherein, delta (I)_i,I_j) Indicating whether the image pairs are similar, d (I)_i,I_j) Representing the distance between the image pairs, margin being an edge coefficient; delta (I)_i,I_j) And d (I)_i,I_j) Are defined as follows:

wherein，y_iAnd y_jAre labels, y 'of the i, j images respectively'_i,kAnd y'_j,kThe output values of the ith and the j images of the neural network at the kth node and the number of the output nodes of the m neural network are the same as the length of the extracted binary hash codes in terms of numerical values; b) there is a total loss value L for all image pairs within a batch:

binding of L_pair(I_i,I_j) Definition, L is rewritten as:

6. the apparatus of claim 5, wherein the retrieval module measures similarity between images using the following similarity calculation method:

H₀＝sign(Y')，

The first method comprises the following steps:

and the second method comprises the following steps:

λ₀＝0，