CN114168782B

CN114168782B - Deep hash image retrieval method based on triplet network

Info

Publication number: CN114168782B
Application number: CN202111448102.6A
Authority: CN
Inventors: 曹从军; 陈星�; 廖开阳; 周建
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2024-03-15
Anticipated expiration: 2041-11-30
Also published as: CN114168782A

Abstract

The invention discloses a deep hash image retrieval method based on a triplet network, which specifically comprises the following steps: adopting a triplet network as a CNN network to learn image characteristics; carrying out loss function optimization according to the characteristic learning condition, so that the distance between sample image pairs in the same category is smaller than that between sample image pairs in different categories, and optimizing a neural network model; inputting the query image and the test set image into a triplet network to respectively obtain binary hash codes of the images and depth feature vectors of fc7 layers, firstly, performing coarse retrieval by utilizing the binary hash codes, obtaining candidate set images of similar images according to the Hamming distance and a set threshold value, then performing fine retrieval by utilizing the depth features of fc7 layers, calculating Euclidean distances of the feature vectors of fc7 layers of the query image and the candidate set images, and obtaining final n retrieval results according to the sequence from small Euclidean distances to large. The invention solves the problem of low image retrieval precision in the prior art.

Description

Deep hash image retrieval method based on triplet network

Technical Field

The invention belongs to the technical field of computer image retrieval methods, and relates to a deep hash image retrieval method based on a triplet network.

Background

In recent years, hash technology has been widely used in large-scale image retrieval, and rapid image retrieval is performed by converting high-dimensional features of an image into binary hash codes in a low-dimensional space, and calculating a hamming distance. The method has the advantages of low storage cost and high calculation efficiency. The traditional hash method is to extract high-dimensional features first, learn the features to obtain a hash function, and then convert the high-dimensional features of the image into low-dimensional space hash codes by utilizing hash function mapping. With the development of CNN, a depth hash algorithm combining the depth features extracted by CNN with hash learning appears, so that hash codes representing deeper information of images can be directly obtained. However, most of the existing deep hash methods have insufficient utilization of supervision information, and cannot acquire more distinguishable hash codes, so that a model obtained through training does not reach an expected effect, and the accuracy of a search result is directly affected.

Therefore, how to provide an image retrieval method to improve the retrieval accuracy of the hash technique has important value.

Disclosure of Invention

The invention aims to provide a deep hash image retrieval method based on a triplet network, which solves the problem of low image retrieval precision in the prior art.

The technical scheme adopted by the invention is that the deep hash image retrieval method based on the triplet network is implemented according to the following steps:

step 1, adopting three parallel Alexnet network models as CNN network models, adding a hidden layer after the fc7 layer of the last full-connection layer of each Alexnet network model, namely adding a hash layer, and sharing weight parameters among the three Alexnet network models;

step 2, selecting an image data set CIFAR-10, dividing the image data set CIFAR-10 into a training set and a test set, taking a training set image as an input image of the CNN network model constructed in the step 1, wherein each input image is a triplet image and comprises an anchor point image, a positive sample image and a negative sample image, wherein the positive sample image and the anchor point image are of the same category, and the negative sample image and the anchor point image are of different categories;

step 3, respectively inputting an anchor point image, a positive sample image and a negative sample image of each input image into an Alexnet network model, and respectively generating a depth feature of an output fc7 layer and a preliminary hash code representing image features at the output fc7 layer and the hash layer by the anchor point image, the positive sample image and the negative sample image;

step 4, introducing a triplet loss function into the CNN network model, and optimizing the CNN network model by minimizing the triplet loss function by using the preliminary hash code generated in the step 3, so that the distances of the anchor point image and the positive sample image in the feature space are smaller than those of the anchor point image and the negative sample image in the feature space;

and 5, inputting the test set image and the query image into an Alexnet network model of the CNN network model optimized in the step 4, which corresponds to the input anchor point image, to obtain corresponding hash codes and fc7 layer depth characteristics, wherein the test image and the query image are anchor point images, calculating the Hamming distance of binary hash codes corresponding to the query image and the test set image, considering that the query image is similar to the test set image according to a set threshold value if the Hamming distance is smaller than the threshold value, obtaining m most similar candidate image sets in the test set image, then calculating the Euclidean distance between depth characteristic vectors of the query image vector and the candidate set image, and obtaining a final retrieval result according to the descending order, wherein n most similar images are found in the candidate image sets, namely the query image most similar n images.

The present invention is also characterized in that,

the CIFAR-10 image data set is 60000, the training set is classified into 10 types, 5000 types are classified into each type, and the test set is classified into 10 types, and 1000 types are classified into each type.

In the step 4, the distance between the anchor point image and the positive sample image in the feature space is smaller than the distance between the anchor point image and the negative sample image in the feature space, namely:

S(x _i ,x _i ⁺ )<S(x _i ,x _i ^- ) (2)

equivalent to

S ² (x _i ,x _i ⁺ )<S ² (x _i ,x _i ^- ) (3)

Wherein x is _i For anchor pictures, x _i ⁺ Is a positive sample image, i.e. is associated with x _i Sample images of the same class, x _i ^- As a negative sample image, i.e. with x _i Sample images of different classes, S (x, x') representing two sheetsThe similarity of images x and x', i.e. S (x _i ,x _i ⁺ ) For similarity of anchor image and positive sample image, S (x _i ,x _i ^- ) Similarity of anchor image and negative sample image, according to which the triplet loss function is defined as follows:

loss(x _i ,x _i ⁺ ,x _i ^- )＝max{0,g+S ² (x _i ,x _i ⁺ )-S ² (x _i ,x _i ^- )} (4)

in the formula (x) _i ,x _i ⁺ ,x _i ^- ) Represents a triplet loss function, where g is a threshold parameter that measures the distance between the same category image pair and a different category image pair.

g＝-1。

The optimization of the CNN network model, namely the optimization of the triplet loss function, comprises the following specific processes:

record training set x= { X _i (X) _i }＝(x _i ,x _i ⁺ ,x _i ^- ) The optimization of the triplet loss function is then translated into the minimization problem of the following objective function L, namely:

wherein W is a neural network parameter, lambda is a regularization coefficient introduced for improving the generalization capability of the network model, and n is the number of triplets in the training set;

introducing a function d (W, X _i ) Representing distances between anchor images, positive sample images and anchor image, negative sample image pairs in triplets, i.e

d(W,X _i )＝S ² (x _i ,x _i ⁺ )-S ² (x _i ,x _i ^- ) (6)

The objective function may be further written as:

to optimize the objective function to minimize it, the partial derivative of the objective function is:

by d (W, X) _i ) Can be defined by d (W, X) _i ) The gradient of (2) is as follows:

from this, it can be seen that, inputting the triplet image into the network, performing forward and reverse computation to obtain the gradient of the triplet loss function, and then using the gradient descent algorithm to obtain the optimized objective function, thereby optimizing the triplet loss function and obtaining the optimal CNN network model.

In step 5, the Hamming distance of binary hash codes corresponding to the query image and the test set image is specifically:

firstly, giving an image I in a test set, inputting a query image I into a CNN network model optimized in the step 5, extracting a hidden layer, namely, taking the output of a hash layer as an image characteristic, marking as Out (H), carrying Out binarization by setting a threshold value to obtain a corresponding binary hash code, namely, for each bit j=1..h, wherein H is the node number of the hidden layer, and the output binary code is as follows

Setting Γ= { I ₁ ,I ₂ ,...,I _n Test set of image formations for retrieval，I _n Representing the nth test set image, each corresponding binary code being represented as Γ _H ＝{H ₁ ,H ₂ ,...,H _n }, wherein H _i ∈{0,1} ^h ；

For query image I _q And its corresponding binary code H _q Calculate H _q And H _i If H is the Hamming distance of _q And H _i ∈Γ _H If the Hamming distance is smaller than the threshold value, the Hamming distance is considered as a similar image, and a candidate image set with m similar images is finally obtained according to the threshold value setting, so that a corresponding candidate image set is obtained

In step 5, the Euclidean distance between the query image vector and the depth feature vector of the candidate set image is specifically:

for query image I _q And candidate image set P, V _q Feature vector, V, representing query image q _i ^P Representing feature vectors of the ith image in the candidate set images, the euclidean distance between their corresponding feature vectors is calculated by:

s _i ＝||V _q -V _i ^P || (12)

the smaller the Euclidean distance is, the higher the similarity of the two images is, and finally n most similar result images are obtained in the candidate image set according to the sequence from small to large of the Euclidean distance, wherein n is smaller than m.

The beneficial effects of the invention are as follows:

the invention selects the triplet network sharing weight to learn the image characteristics, can fully utilize the supervision information of the images to obtain the depth characteristics of the representative images, and further optimizes the network model by utilizing the triplet loss function, so that the distance between images of the same category is smaller, the distance between images of different categories is larger, and the binary hash code with more differentiation and higher query precision is obtained. When searching the image, firstly, rough searching is carried out by utilizing hash coding to obtain a candidate image set, then fine searching is carried out by utilizing fc7 layer depth characteristics of the image, and finally, n most similar images are obtained in the candidate image set, thereby effectively improving the searching precision.

Drawings

FIG. 1 is a schematic diagram of image feature learning in a deep hash image retrieval method based on a triplet network;

FIG. 2 is a flow chart of a deep hash image retrieval method based on a triplet network in accordance with the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a deep hash image retrieval method based on a triplet network, and the flow of the method is shown in figure 2. The method is implemented according to the following steps:

step 1, as shown in fig. 1, three parallel Alexnet network models are adopted as CNN network models, and a hidden layer is added after the fc7 layer of the last full-connection layer of each Alexnet network model, namely a hash layer is added, and weight parameters are shared among the three Alexnet network models;

step 2, selecting an image data set CIFAR-10, dividing the image data set CIFAR-10 into a training set and a test set, taking the training set image as an input image of the CNN network model constructed in the step 1, wherein the image data set CIFAR-10 is divided into 10 types, each type of training set is divided into 5000 types, each type of test set is divided into 10 types, each type of test set is 1000, each input image is a triplet image and comprises an anchor point image, a positive sample image and a negative sample image, the positive sample image and the anchor point image are of the same type, and the negative sample image and the anchor point image are of different types;

step 4, introducing a triplet loss function into the CNN network model, and using the preliminary hash codes generated in the step 3 to minimize the triplet loss function, so that the distances of the anchor point image and the positive sample image in the feature space are smaller than those of the anchor point image and the negative sample image in the feature space, namely the hash codes generated by the images of the same category are more similar, and the hash code differences of the images of different categories are larger, thereby completing the optimization of the CNN network model;

the optimization process comprises the following steps:

assume that the relative distance relationship between sample images is represented using a series of image triplets: { X _i }＝(x,x ⁺ ,x ^- ) Where x is the anchor image, x ⁺ Is a positive sample, i.e. a sample image of the same class as x, x ^- Is a negative sample image, i.e. a sample image of a different class than x. According to the distance of the images in European space, the similarity of two images x, x' is defined as:

S(x,x`)＝||F(x)-F(x`)|| ₂ (1)

where S (x, x') is the similarity of two images, F (x) is a function of mapping image data to euclidean space, and for a feature mapping function implemented by a deep neural network, F (x) represents the network output of image x, or a feature representation of image x.

For a triplet during training, the distance between the anchor point image and the positive sample image in the feature space is smaller than the distance between the anchor point image and the negative sample image in the feature space, namely the feature representation expected to be output should be in L ₂ The following condition is satisfied under the norm:

S(x _i ,x _i ⁺ )<S(x _i ,x _i ^- ) (2)

equivalent to

S ² (x _i ,x _i ⁺ )<S ² (x _i ,x _i ^- ) (3)

Wherein x is _i For anchor pictures, x _i ⁺ Is a positive sample image, i.e. is associated with x _i Sample images of the same class, x _i ^- As a negative sample image, i.e. with x _i Sample images of different classes, S (x, x ') representing the similarity of the two images x and x', i.e. S (x) _i ,x _i ⁺ ) For similarity of anchor image and positive sample image, S (x _i ,x _i ^- ) The similarity of the anchor image and the negative sample image, where the form of square is used, facilitates the calculation of the derivative in the subsequent implementation of the optimization. For a given triplet { X ] _i For classification problems, it is desirable to maximize the distance between the same class image pair and the non-same class image pair, so the triplet loss function is defined as follows:

in the formula (x) _i ,x _i ⁺ ,x _i ^- ) Represents a triplet loss function, where g is a threshold parameter that measures the distance between the same-category image pair and a different-category image pair, g= -1.

d(W,X _i )＝S ² (x _i ,x _i ⁺ )-S ² (x _i ,x _i ^- ) (6)

The objective function may be further written as:

from this, it can be seen that, inputting the triplet image into the network, performing forward and reverse computation to obtain the gradient of the triplet loss function, and then using the gradient descent algorithm to obtain the optimized objective function, thereby optimizing the triplet loss function, obtaining the optimal CNN network model, and extracting the most essential deep features of the image, and obtaining the hash code with higher differentiation and query precision.

Step 5, inputting the test set image and the query image into an Alexnet network model of the optimized CNN network model corresponding to the input anchor point image in step 4, obtaining corresponding hash codes and fc7 layer depth characteristics, wherein the test image and the query image are anchor point images, calculating the Hamming distance of binary hash codes corresponding to the query image and the test set image, considering that the query image is similar to the test set image according to a set threshold value, obtaining m most similar candidate image sets in the test set image if the Hamming distance is smaller than the threshold value, then calculating the Euclidean distance between depth feature vectors of the query image vector and the candidate set image, and obtaining a final retrieval result according to the descending order, wherein n most similar images are found in the candidate image sets, namely the query image most similar n images;

the Hamming distance of binary hash codes corresponding to the query image and the test set image is calculated specifically as follows:

Setting Γ= { I ₁ ,I ₂ ,...,I _n Test set of images for retrieval, I _n Representing the nth test set image, each corresponding binary code being represented as Γ _H ＝{H ₁ ,H ₂ ,...,H _n }, wherein H _i ∈{0,1} ^h ；

The Euclidean distance between the query image vector and the depth feature vector of the candidate set image is calculated specifically as follows:

for query image I _q And candidate image set P, V _q Feature vector, V, representing query image q _i ^P Special for representing ith image in candidate set imageThe eigenvectors, the euclidean distance between their corresponding eigenvectors is calculated by:

s _i ＝||V _q -V _i ^P || (12)

Claims

1. The deep hash image retrieval method based on the triplet network is characterized by comprising the following steps of:

2. The method for searching deep hash images based on a triplet network according to claim 1, wherein the image dataset CIFAR-10 is 60000, the training set is classified into 10 categories, 5000 categories, and the testing set is classified into 10 categories, 1000 categories.

3. The method for searching deep hash images based on triple-play network according to claim 1, wherein in the step 4, the distance between the anchor point image and the positive sample image in the feature space is smaller than the distance between the anchor point image and the negative sample image in the feature space, namely:

S(x _i ,x _i ⁺ )＜S(x _i ,x _i ^- ) (2)

equivalent to

S ² (x _i ,x _i ⁺ )＜S ² (x _i ,x _i ^- ) (3)

Wherein x is _i For anchor pictures, x _i ⁺ Is a positive sample image, i.e. is associated with x _i Sample images of the same class, x _i ^- As a negative sample image, i.e. with x _i Sample images of different classes, S (x, x ') representing the similarity of the two images x and x', i.e. S (x) _i ,x _i ⁺ ) For anchor and positive sample imagesSimilarity, S (x _i ,x _i ^- ) Similarity of anchor image and negative sample image, according to which the triplet loss function is defined as follows:

4. A deep hash image retrieval method based on a triplet network according to claim 3, characterized in that g= -1.

5. A deep hash image retrieval method based on a triplet network according to claim 3, wherein the optimization of the CNN network model is the optimization of the triplet loss function, and the specific process is:

d(W,X _i )＝S ² (x _i ,x _i ⁺ )-S ² (x _i ,x _i ^- ) (6)

The objective function is further written as:

6. The method for searching deep hash images based on triple-play network according to claim 4, wherein the calculating the hamming distance of binary hash codes corresponding to the query image and the test set image in step 5 is specifically:

7. The method for searching deep hash images based on triple-play network according to claim 6, wherein the euclidean distance between the query image vector and the depth feature vector of the candidate set image in step 5 is specifically:

s _i ＝||V _q -V _i ^P || (12)

the smaller the Euclidean distance is, the higher the similarity of the two images is, and finally n most similar result images are obtained in the candidate image set according to the sequence from small to large according to the Euclidean distance, wherein n is smaller than m.