CN112199533B

CN112199533B - Unsupervised deep hash model training method and image retrieval method based on node characterization

Info

Publication number: CN112199533B
Application number: CN202011100159.2A
Authority: CN
Inventors: 汪洋涛; 刘渝; 李春花; 牛中盈; 王冲; 周可
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2024-02-06
Anticipated expiration: 2040-10-15
Also published as: CN112199533A

Abstract

The invention discloses an unsupervised depth hash model training and image retrieval method based on node characterization. According to the method, each image is regarded as a node, similarity of the images is measured by utilizing a similarity distance between image initialization features, a self-encoder based on a graph convolution network is designed, node characterization information of each image is generated in an unsupervised mode according to the similarity between the images, then learning of a hash function is guided on a lightweight network by utilizing the characterization information, semantic hash codes of the images are generated, and the unsupervised hash image retrieval performance is improved. The invention effectively learns the similarity between images, and the image retrieval performance is better than that of the current unsupervised hash method.

Description

Unsupervised deep hash model training method and image retrieval method based on node characterization

Technical Field

The invention belongs to the field of image retrieval, and in particular relates to an unsupervised depth hash model training method and an image retrieval method based on node characterization.

Background

The hash code has the advantages of light weight storage and efficient exclusive OR operation, so that the image hash becomes an important technical means in the field of image retrieval.

The image retrieval method based on the image hash comprises a supervised hash retrieval method and an unsupervised hash retrieval method, wherein the supervised hash method is added with label information of the image, such as class labels of the image, in the learning process of a hash function, so that semantic similarity between data can be maintained better, and a high-quality hash code is generated. The unsupervised hash method does not need label information of the data, and the original geometric structure similarity of the data is kept to train the hash function by directly mining the internal connection of the data.

Although the supervised hash search method achieves good search effect, in an actual scene, the supervised hash search method cannot realize the search of the label image, so that the unsupervised hash search method is more common. However, the unsupervised hash search method does not sufficiently learn the similarity between images, thereby limiting the quality of the hash code, and thus resulting in poor image search results.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides an unsupervised depth hash model training method and an image retrieval method based on node characterization, which aim to solve the technical problems that the quality of hash codes is limited and the image retrieval effect is poor due to insufficient similarity between images in the conventional unsupervised hash retrieval method.

To achieve the above object, according to one aspect of the present invention, there is provided a training method of an unsupervised depth hash model based on node characterization, the unsupervised depth hash model including a res net-101 network, a GCN-based self-encoder AE, and a res net-18 network connected in sequence, the training method comprising the steps of:

(1) Acquiring a training set formed by N images, inputting the training set into a ResNet-101 network to extract the feature vector of each image from a pooling layer of the network, wherein the feature vectors of all the images form a vector set FV:

FV＝{v ₁ ,v ₂ ,…,v _N },

wherein N is a positive integer, v _q Represents the q-th feature vector in the vector set FV and has q ε [1, N]；

(2) Selecting feature vectors of M images from the vector set FV obtained in the step (1) to form M nodes, and constructing a graph structure G= (X, E) according to the M nodes, wherein X represents the set formed by the M nodes and has X= [ X ] ₁ ，x ₂ ，…，x _M ]，Representing any two nodes x of the M nodes _i And x _j Edges joined, where i and j are positive integers and i and j are E [1, M]；

(3) Acquiring a relation matrix A of M nodes from the graph structure G constructed in the step (2), performing iterative training on the AE based on GCN by using the acquired relation matrix A and the M nodes until the AE converges, and acquiring each node x at the moment _k G-dimensional characterization information g (X) _k ) Wherein k is [1, M]，g∈{16,32,48,64,96,128}；

(4) Using the node x obtained in step (3) _k G-dimensional characterization information g (X) _k ) Iterative training of the ResNet-18 network as tag information until the ResNet-18 network converges and node x is obtained at this time _k Corresponding g-dimensional semantic hash code b _k 。

Preferably, the coding network comprises a K-layer GCN for updating node characteristics, and the output of the coding network is node x _k G-dimensional characterization information g (X) _k ) Where K is any natural number, preferably 5.

Preferably, the step (3) specifically comprises:

first, a potential representation X of all M nodes in a layer 1 GCN in a coding network is obtained ^l+1 Wherein l is [1, K ]]：

Wherein X is ^l Representing potential characterizations of all M nodes in a layer I GCN in a coding network, X ¹ ＝X，W ^l Representing the weight of the layer I GCN in the coding network, f ^l (-) represents a nonlinear activation function of the layer one GCN in the coding network,representing a normalized version of the relationship matrix a;

next, an output of the layer 1 GCN in the decoding network is generated in the reconstruction space

Wherein the method comprises the steps ofRepresenting the decoding result of the layer I GCN in the decoding network,/I>Indicating the weight of the layer I GCN in the decoding network,/->Representing a nonlinear activation function of a layer I GCN in the decoding network;

next, based on the output of the layer 1 GCN in the decoding networkAnd the set X calculates a loss function value L ₁ The above process is repeated until the loss function value L ₁ Obtaining trained AE until the data is minimum, and taking output of a K-th layer GCN in a coding network of the AE at the moment as a node x _k G-dimensional characterization information g (X) _k )。

Preferably, the normalized version of the relationship matrix AEqual to:

wherein the method comprises the steps ofI _C Is an identity matrix>Is a diagonal matrix and satisfies->

Loss function value L ₁ Equal to:

wherein,representing the use of a two-norm calculation.

Preferably, the step (4) specifically comprises:

first, the node x obtained in the step (3) is obtained _k G-dimensional characterization information g (X) _k ) The ResNet-18 network is input, and then at the last full connection layer of the ResNet-18 network, the mean square error function L is used _fc So that the output result fc (x _k ) Approximation node x _k Characterization information g (x _k )；

The last active layer of the ResNet-18 network is then used to output the result fc (x _k ) Activating, thereby obtaining an activation result fch (x _k )＝Tanh(fc(x _k ) And) wherein

Then, a loss function value L of the ResNet-18 network is calculated according to the mean square error function and the binary cross entropy loss function ₂ ；

The above process is then repeated until the loss function value L ₂ Minimum, thereby obtaining a trained ResNet-18 network, when using a binary cross entropy loss function L _fch To the activation result fch (x _k ) Binarizing to generate each node x _k G-dimensional semantic hash code of e X, i.e. g-dimensional language corresponding to each image in step (1)The hash code is defined.

Preferably, the mean square error function is

The binary cross entropy loss function is:

wherein coefficient b _k ＝0.5×(sign(fch(x _k ) +1)), sign (·) is a sign function,representation b _k Is selected from the group consisting of the (i) th bit,representing fch (x _k ) Is the ith bit of (c).

Preferably, the ResNet-18 network has a loss function value L ₂ Equal to:

L ₂ ＝μL _fc +ωL _fch ,

where μ and ω are hyper-parameters and satisfy μ+ω=1.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

because the invention uses the step (2) and the step (3) in the model training stage, firstly, each image is regarded as a node and the nodes are organized into a graph structure, and further, the association relations among different nodes are learned by GCN and are fused into the characterization information of the nodes, the technical problem that the quality of hash codes is limited due to insufficient similarity between the learned images in the existing unsupervised hash retrieval method, and further, the image retrieval effect is poor can be solved.

(2) In the invention, step (4) is adopted in the model training stage, node characterization information is used as label information to guide the training of the hash function, and the ResNet-18 network is a lightweight network, so that the semantic similarity between the hash codes can be ensured to maintain the correlation between the nodes, and the hash codes with high quality can be generated efficiently.

According to another aspect of the present invention, there is provided an image retrieval method including the steps of:

(1) Obtaining images to be retrieved from a user, and obtaining each contrast image from a database;

(2) Generating g-dimensional semantic hash codes corresponding to the images to be searched and each contrast image respectively according to the unsupervised depth hash model obtained by training based on the training method of the unsupervised depth hash model of node characterization;

(3) And carrying out exclusive OR operation on the g-dimensional semantic hash codes corresponding to the images to be retrieved and the g-dimensional semantic hash codes corresponding to the contrast images so as to calculate the Hamming distance between the g-dimensional semantic hash codes and the g-dimensional semantic hash codes.

(4) And (3) selecting a contrast image corresponding to the Hamming distance within the specified Hamming radius from all the Hamming distances obtained in the step (3) as a final image retrieval result to output.

Preferably, the specified hamming radius range is {1,2,3}, preferably 2.

the invention combines GCN and AutoEncoder to generate potential characterization information of the image in an unsupervised mode, uses lightweight ResNet-18 network to fit the characterization information of the nodes and generates semantic hash codes of the image, thereby improving the unsupervised hash image retrieval performance and further solving the problem that the existing scheme cannot fully learn the similarity between the images.

Drawings

FIG. 1 is a schematic diagram of the overall framework of the training method of the unsupervised deep hash model based on node characterization of the present invention;

FIG. 2 is a flow chart of a method of training an unsupervised deep hash model based on node characterization of the present invention;

FIG. 3 is a flow chart of an image retrieval method provided by the invention;

FIG. 4 is a search result on a CIFAR-10 dataset using a 48-bit hash code length with a Hamming radius of 2 in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention is realized on the basis of a graph rolling network (GCN) and a self-encoder (AutoEncoder), and the image retrieval performance of a model is improved by forming images into a graph structure, integrating the similarity between images into the characteristics of the images by using the GCN, and further generating the characterization information of the images by using the AutoEncoder in an unsupervised mode.

In the experiment of the invention, through testing on a plurality of image data sets, the advantages of the invention on the multi-label image data set are more obvious, the reason is that each image in the multi-label data set has more abundant semantic information, the GCN fully captures and learns the semantic similarity among the images, more accurate node characterization is generated, and a high-quality semantic hash code is obtained, so that the image retrieval performance of the model is further improved.

As shown in fig. 1 and 2, the present invention provides a training method of an unsupervised deep hash model based on node characterization, the unsupervised deep hash model including a res net-101 network, a self encoder (AutoEncoder, AE) based on a graph rolling network (Graph Convolutional Network, GCN), and a res net-18 network, which are sequentially connected, the training method comprising the steps of:

FV＝{v ₁ ,v ₂ ,…,v _N },

wherein N is a positive integer, v _q Represents the q-th feature vector in the vector set FV and has q ε [1, N]。

AE is composed of coding network and decoding network, wherein the coding network comprises K-layer GCN for updating node characteristics, and the output of the coding network is node x _k G-dimensional characterization information g (X) _k ) Where K is any natural number, preferably 5.

The step (3) is specifically that firstly, potential representation X of all M nodes in the GCN of the layer 1 in the coding network is obtained ^l+1 Wherein l is [1, K ]]：

Wherein X is ^l Representing potential characterizations of all M nodes in a layer I GCN in a coding network, X ¹ ＝X，W ^l Representing the weight of the layer I GCN in the coding network, f ^l (-) represents a nonlinear activation function of the layer one GCN in the coding network,representing a normalized version of the relationship matrix a, the calculation process is as follows:

Next, in order for the input of the encoding network to approximate the output of the decoding network (which also uses K-layer GCN) to obtain high quality g-dimensional characterization information g (x _k ) In the decoding network, the encoding network is then inverted to generate the output of the layer 1 GCN in the decoding network in the reconstruction space

next, based on the output of the layer 1 GCN in the decoding networkAnd the set X calculates a loss function value L ₁ The above process is repeated until the loss function value L ₁ Obtaining trained AE until the data is minimum, and taking output of a K-th layer GCN in a coding network of the AE at the moment as a node x _k G-dimensional characterization information g (X) _k )；

Wherein,representing the use of a two-norm calculation.

The steps (2) and (3) have the advantages that each image is firstly regarded as a node, the nodes are organized into a graph structure, the GCN is adopted to learn the association relations among different nodes, and the relations are integrated into the characterization information of the nodes, so that the technical problem that the quality of hash codes is limited due to insufficient similarity between the images in the existing unsupervised hash retrieval method, and the image retrieval effect is poor is solved.

The step (4) is specifically that the node x obtained in the step (3) is first selected _k G-dimensional characterization information g (X) _k ) ResNet-18 network is input, and then a mean square error function L is used at the last Full Connection (FC) layer of the ResNet-18 network _fc So that the output of the full connection layerResults fc (x) _k ) Approximation node x _k Characterization information g (x _k ) The method comprises the steps of carrying out a first treatment on the surface of the The final active layer (Full connection hash, FCH) of the ResNet-18 network is then used to output the result fc (x _k ) Activating, thereby obtaining an activation result fch (x _k )＝Tanh(fc(x _k ) And) whereinThen, according to the mean square error function->And binary cross entropy loss functionTo calculate the loss function value L of ResNet-18 network ₂ (wherein coefficient b _k ＝0.5×(sign(fch(x _k ) +1)) sign (·) is a sign function, ++>Representation b _k Is selected from the group consisting of the (i) th bit,representing fch (x _k ) I-th bit of (c) and repeatedly iterating the above process until the loss function value L ₂ Minimum, thereby obtaining a trained ResNet-18 network, when using a binary cross entropy loss function L _fch To the activation result fch (x _k ) Binarizing to generate each node x _k G-dimensional semantic hash code of e X, i.e. g-dimensional semantic hash code corresponding to each image in step (1), loss function value L of ResNet-18 network ₂ Equal to:

L ₂ ＝μL _fc +ωL _fch ,

where μ and ω are hyper-parameters and satisfy μ+ω=1.

The method has the advantages that node characterization information is used as label information to guide the training of the hash function, and the ResNet-18 network is a lightweight network, so that semantic similarity among hash codes can be ensured to maintain the correlation among nodes, and high-quality hash codes can be generated efficiently.

As shown in fig. 3, the present invention further provides an image retrieval method, which includes the following steps:

Specifically, the specified hamming radius range is {1,2,3}, preferably 2.

In summary, the invention combines GCN and AutoEncoder to generate potential characterization information of images in an unsupervised mode, uses lightweight ResNet-18 network to fit characterization information of nodes and generate semantic hash codes of the images, thereby improving unsupervised hash image retrieval performance and further solving the problem that the existing scheme cannot fully learn the similarity between images.

Experimental results

The experimental environment of the invention: the CPU is 56 pieces of Inter Xeon (R) @2.4GHz, the GPU is 8 pieces of NVIDIA Tesla P40 GB, the memory is 256GB of DDR4, the hard disk capacity is 6TB, and the algorithm is realized by adopting Pytorch programming under a Linux operating system. The specific parameter settings are as follows: the batch size was 32 and the learning rate was 0.001.

To demonstrate the effectiveness of GCN in the present invention and the superiority of the search results of the present invention, the present invention conducted related tests on CIFAR-10 dataset and MS-COCO dataset. In the node characterization learning stage, different feature extraction networks VGG and ResNet-101 are used, and the retrieval performance mAP of the model under the two conditions of using GCN and not using GCN is recorded. "-" indicates that GCN was not used, "+" indicates that GCN was used, and the test results are given in Table 1.

TABLE 1 mAP values with/without GCN according to the invention

As can be seen from the results in Table 1, the present invention can achieve the highest mAP value by using a combination of ResNet-101 and GCN. Furthermore, the present invention has found that good results can be obtained using a combination of VGG and GCN. However, if GCN is not used, neither VGG nor ResNet-101 alone can achieve satisfactory results. These results indicate that GCN plays a critical role in improving the performance of hashing. The invention further observes that under the same conditions, the ResNet-101 is relatively better than VGG effect, and also shows that the network with stronger feature extraction capability is more helpful to improving hash performance.

FIG. 4 shows the result of the search on the CIFAR-10 dataset using a 48-bit hash code length with a Hamming radius of 2 according to the present invention. According to the invention, 4 images are randomly selected as search images: automobiles, birds, horses, and boats. From the returned search results, it can be seen that the search image still contains automobiles, birds, horses and boats. This reflects that the invention does learn semantic similarity between images and achieves good retrieval effect.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A training method of an unsupervised deep hash model based on node characterization, the unsupervised deep hash model comprising a res net-101 network, a GCN-based self-encoder AE, and a res net-18 network connected in sequence, the training method comprising the steps of:

FV＝{v ₁ ,v ₂ ,…,v _N },

(3) Acquiring a relation matrix A of M nodes from the graph structure G constructed in the step (2), performing iterative training on the AE based on GCN by using the acquired relation matrix A and the M nodes until the AE converges, and acquiring each node x at the moment _k G-dimensional characterization information g (X) _k ) Wherein k is [1, M]G is {16,32,48,64,96,128}; the step (3) is specifically as follows:

Wherein X is ^l Representing potential characterizations of all M nodes in a layer I GCN in a coding network, X ¹ ＝X，W ^l Representing the weights of the layer i GCN in the coding network,f ^l (-) represents a nonlinear activation function of the layer one GCN in the coding network,representing a normalized version of the relationship matrix a;

Wherein the method comprises the steps ofRepresenting the decoding result of the layer I GCN in the decoding network,/I>Representing the weights of the layer i GCN in the decoding network,representing a nonlinear activation function of a layer I GCN in the decoding network;

(4) Using the node x obtained in step (3) _k G-dimensional characterization information g (X) _k ) Iterative training of the ResNet-18 network as tag information until the ResNet-18 network converges and node x is obtained at this time _k Corresponding g-dimensional semantic hash code b _k The method comprises the steps of carrying out a first treatment on the surface of the Step (a)(4) The method comprises the following steps:

first, the node x obtained in the step (3) is obtained _k G-dimensional characterization information g (X) _k ) The ResNet-18 network is input, and then at the last full connection layer of the ResNet-18 network, the mean square error function L is used _fc So that the output result fc (x _k ) Approximation node x _k Characterization information g (x _k ) The method comprises the steps of carrying out a first treatment on the surface of the Mean square error function of

The binary cross entropy loss function is:

wherein coefficient b _k ＝0.5×(sign(fch(x _k ) +1)), sign (·) is a sign function,representation b _k Is selected from the group consisting of the (i) th bit,representing fch (x _k ) Is the ith bit of (2);

The above process is then repeated until the loss function value L ₂ Minimum, thereby obtaining a trained ResNet-18 network, when using a binary cross entropy loss function L _fch To the activation result fch (x _k ) Binarizing to generate each node x _k G-dimensional semantic hash codes of e X, i.e., g-dimensional semantic hash codes corresponding to the respective images in step (1).

2. Training method according to claim 1, characterized in that the coding network comprises a K-layer GCN for updating node characteristics, the output of the coding network being node x _k G-dimensional characterization information g (X) _k ) Wherein K is any natural number.

3. The training method of claim 2, wherein,

normalized version of the relationship matrix AEqual to:

Loss function value L ₁ Equal to:

wherein,representing the use of a two-norm calculation.

4. A training method as claimed in claim 3Characterized in that the loss function value L of the ResNet-18 network ₂ Equal to:

L ₂ ＝μL _fc +ωL _fch ,

where μ and ω are hyper-parameters and satisfy μ+ω=1.

5. An image retrieval method, characterized by comprising the steps of:

(2) The method for training the unsupervised depth hash model based on node characterization according to any one of claims 1 to 4, wherein the unsupervised depth hash model obtained by training is used for generating g-dimensional semantic hash codes corresponding to the image to be retrieved and each contrast image respectively;

(3) Performing exclusive OR operation on the g-dimensional semantic hash codes corresponding to the images to be retrieved and the g-dimensional semantic hash codes corresponding to the contrast images so as to calculate the Hamming distance between the g-dimensional semantic hash codes and the g-dimensional semantic hash codes;

6. The image retrieval method as recited in claim 5, wherein the specified hamming radius range is {1,2,3}.