CN112559810B

CN112559810B - Method and device for generating hash code by utilizing multi-layer feature fusion

Info

Publication number: CN112559810B
Application number: CN202011533344.0A
Authority: CN
Inventors: 马然; 余海波; 苏敏; 安平
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-04-08
Anticipated expiration: 2040-12-23
Also published as: CN112559810A

Abstract

The invention discloses a method and a device for producing hash codes by utilizing multi-layer feature fusion, comprising the following steps: establishing a similarity matrix of the image-text pairs; obtaining the characteristics of different layers through the output of different residual blocks, converting the characteristics of the different layers into characteristic graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to the images through global pooling, full connection and discretization; generating a corresponding multi-scale BOW model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolution layer, fusing, and finally obtaining hash codes corresponding to the texts through a full connection layer; designing a loss function; training a model; and inputting the sample into the model obtained by training to obtain the corresponding hash code. By the method and the device, the generated hash code has stronger distinguishability, and the average accuracy of retrieval can be effectively improved when the method and the device are used for cross-modal retrieval.

Description

Method and device for generating hash code by utilizing multi-layer feature fusion

Technical Field

The invention relates to the technical field of image retrieval, in particular to a method and a device for generating a hash code by utilizing multi-layer feature fusion.

Background

With the rapid development of networks, more and more data of different modalities, such as images, texts, etc., appear on the internet. The traditional single-mode retrieval cannot meet the requirements of people, and therefore cross-mode retrieval is proposed. The goal of cross-modality retrieval is to find semantically similar instances in one modality (e.g., text) with queries from another modality (e.g., images). However, similarity measures between different modality data are very challenging due to the heterogeneity differences between different modality data, and the semantic gap between low-level features and high-level semantics. A common way to compensate for this difference is to map the different modality data into a common subspace and then measure the similarity between them in the common space.

The cross-modal retrieval based on the Hash maps high-dimensional data of different modalities to a low-dimensional public Hamming space through a series of designed Hash functions, and the learned Hash codes keep semantic information of original data. The compact hash codes have smaller storage cost relative to high-dimensional image features, and the Hamming distance between the hash codes can be calculated through the XOR operation between bits, so that the operation speed is greatly improved. The traditional cross-modal hash algorithm is based on manual design to extract the characteristics of different modal data, and then the extracted characteristics are utilized to generate hash codes corresponding to the data. The feature extraction and the Hash learning are two relatively independent processes, information feedback does not exist between the two processes, so that the extracted features and the Hash learning cannot be well adapted, the performance of the model is limited by the expression capability of manual features, and a semantic gap is easily generated when the retrieval system carries out cross-mode retrieval.

In recent years, Deep Convolutional Neural Networks (DCNN) has exhibited strong feature extraction capability in the fields of image recognition, target detection and the like, and therefore some work has proposed a modal hash search algorithm based on Deep learning by combining the DCNN with a hash algorithm. One representative work is Deep Cross-Modal Hashing (DCMH) method proposed by Jiang, Qingyuan, et al, "Deep Cross-modular Hashing" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition works, 2017, which extracts features of images and texts by using DCNN, then maps the features into a common hamming space, and integrates the feature extraction and the hash learning into a frame capable of end-to-end training through a pre-designed loss function, so that the performance of the method is greatly improved compared with the performance of the traditional algorithm. Li, Chao, et al, "Self-Supervised adaptive Hashing Networks for Cross-Modal Retrieval," Proceedings of the IEEE Conference on Computer Vision and Pattern registration works, 2018. Deep Visual Semantic Hashing (DVSH) method introduced into a Long Short Term Memory network (LSTM) to capture the intrinsic association between images and text is proposed by Cao, Yue, et al, "Deep Visual-Semantic Hashing for Cross-Modal retrieval," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining.2016. The above methods based on Deep convolutional Deep network Learning only use the features extracted from a certain layer of the Deep convolutional neural network to represent data of different modalities, and the extracted features are usually high-layer features in the network, such as output of fc8 layer of VGG network proposed by simony, Karen, and Andrew Zisserman, "right release communication Networks for target-Scale Image registration.

Therefore, it is urgently needed to provide a hash code with stronger discriminative performance and higher cross-modal retrieval accuracy.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a device for generating a hash code by utilizing multi-layer feature fusion, the multi-layer feature fusion is utilized to obtain the features with higher robustness, the generated hash code has stronger distinguishability, and the average accuracy of retrieval can be effectively improved when the method and the device are used for cross-modal retrieval.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention provides a method for generating a hash code by utilizing multilayer feature fusion, which comprises the following steps:

s11: establishing a similarity matrix of the image-text pairs;

establishing a similarity matrix S of the image-text pairs according to the label information in the data set, and if the image is similar to the text, S_ijIs 1, otherwise is 0;

s12: designing an image network model;

obtaining the characteristics of different layers through the output of different residual blocks, converting the characteristics of the different layers into characteristic graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to the images through global pooling, full connection and discretization;

s13: designing a text network model;

generating a corresponding multi-scale Bag-of-words model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolutional layer, fusing the features, and finally obtaining a hash code corresponding to the text through a full connection layer;

s14: designing a loss function by using the similarity matrix S in the S11;

s15: training a model;

for the image network, randomly selecting two images and sending the two images into the image network, utilizing the loss function in the S14 to carry out constraint, using SGD to train the images, and simultaneously fixing text network parameters;

for the text network, randomly selecting two texts to be sent into the text network, constraining by using the loss function in the S14, training the texts by using the SGD, and fixing image network parameters;

s16: obtaining a hash code;

and inputting the sample into the model obtained by the training of the S15 to obtain a corresponding hash code.

Preferably, the S12 further includes:

s121: obtaining the characteristics of N different layers through the output of different residual blocks;

s122: for the first N-2 layers, the channel numbers of the layers are consistent with those of the (N-1) th layer by utilizing the convolution layer, and then the front N-2 layers are downsampled by utilizing the pooling layer, so that the sizes of the characteristic graphs of the layers are consistent with those of the (N-1) th layer;

s123: for the Nth layer, the number of channels of the layer is made to be consistent with that of the (N-1) th layer by using the convolution layer, then the layer is up-sampled by using the deconvolution layer, and the size of the characteristic diagram of the layer is made to be consistent with that of the (N-1) th layer;

s124: and adding and fusing the features of the N different layers processed in the S122 and S123, and then obtaining the hash code corresponding to the image through the global pooling layer and the full-connection layer and discretizing.

Preferably, the S13 further includes:

s131: the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in a data set, and the input vector can be regarded as a feature map with the length of N and the width of R after multi-scale fusion, wherein R is the number of adopted different scales, and the dimension corresponds to semantic information under different scales;

s132: sending the feature map obtained in the step S131 into a residual block to obtain a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text; s133: and fusing the R feature vectors corresponding to different semantic information obtained in the step S132 in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.

Preferably, the S14 further includes:

the loss function includes: inter-modality similarity loss function, intra-modality similarity loss function, quantization loss function in the hash process.

Preferably, the S14 further comprises:

s141: the inter-modal similarity loss function:

the similarity between F and G is measured using a pair-wise likelihood function:

wherein the content of the first and second substances,

the inter-modal similarity loss function is then:

s142: the intra-modal similarity loss function:

for image modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

s143: the quantization loss function in the hash process is:

the final loss function is:

preferably, obtaining the hash code is further represented by:

b^x＝sign(F_I(x_i；θ_x))

b^y＝sign(F_T(t_j；θ_y))

wherein, b^xAnd b^yHash codes B respectively corresponding to images^xHash code B corresponding to text^yThe sign (x) is a sign function, and the expression is as follows:

the invention also provides a device for generating the hash code by utilizing the multilayer feature fusion, which comprises the following steps: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,

the similarity matrix establishing module of the image-text pairs is used for establishing a similarity matrix S of the image-text pairs according to the label information in the data set, and if the image is similar to the text, the S is_ijIs 1, otherwise is 0;

the image network model design module is used for acquiring the features of different layers through the output of different residual blocks, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining the hash codes corresponding to the images through global pooling, full connection and discretization;

the text network model design module is used for generating a corresponding multi-scale Bag-of-words model for each text by using the multi-scale fusion module, then obtaining features of different scales through the convolution layer and fusing the features, and finally obtaining a hash code corresponding to the text through the full connection layer;

the loss function design module is used for designing a loss function;

the model training module is used for randomly selecting two images for the image network and sending the two images into the image network, utilizing the loss function in the loss function design module to carry out constraint, using SGD to train the loss function and fixing text network parameters at the same time; for a text network, randomly selecting two texts to be sent into the text network, utilizing a loss function in the loss function design module to carry out constraint, using SGD to train the texts, and simultaneously fixing image network parameters;

the hash code obtaining module is used for inputting the samples into the model obtained by the training of the model training module to obtain the corresponding hash codes.

Preferably, the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,

the multi-residual module is used for acquiring the characteristics of N different layers;

the front N-2-layer characteristic diagram adjusting module is used for enabling the channel number of the layers to be consistent with that of the (N-1) th layer by utilizing the convolution layer for the front N-2 layers, and then carrying out down-sampling on the front N-2 layers by utilizing the pooling layer to enable the size of the characteristic diagrams of the layers to be consistent with that of the (N-1) th layer;

the N layer characteristic diagram adjusting module is used for enabling the number of channels of the N layer to be consistent with that of the N-1 layer by utilizing the convolution layer, then utilizing the deconvolution layer to perform upsampling on the N layer, and enabling the size of the characteristic diagram of the N layer to be consistent with that of the N-1 layer;

and the hash code acquisition module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjustment module and the Nth layer feature map adjustment module, and then obtaining the hash code corresponding to the image through a global pooling layer and a full connection layer and discretization.

Preferably, the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,

the multi-scale fusion module is used for changing an input vector into a feature map with the length of N, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in a data set, and R is the number of adopted different scales;

the residual block is used for converting the feature map obtained by the multi-scale fusion module into a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain the global information of the text;

and the hash code acquisition module corresponding to the text is used for fusing the R feature vectors corresponding to different semantic information and obtained by the residual block in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.

Preferably, the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,

the inter-modal similarity loss function module is used for designing an inter-modal similarity loss function;

wherein the content of the first and second substances,

the inter-modal similarity loss function is then:

the intra-modal similarity loss function module is used for designing an intra-modal similarity loss function;

for image modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

the quantization loss function module designs a quantization loss function in the hash process;

the final loss function is:

compared with the prior art, the invention has the following advantages:

(1) according to the method and the device for generating the hash code by utilizing the fusion of the multilayer features, a more robust feature is obtained for each example by fusing different layer features, the feature simultaneously comprises high-layer semantic information and low-layer spatial information, the detail information and abstract information of different layers are fully utilized, the expression capability of the different layer features is fully utilized, and then the more discriminative hash code is generated for each example;

(2) according to the method and the device for generating the hash code by utilizing the multi-layer feature fusion, the features of different layers of the image and the text are extracted, so that the hash code which is generated by fusing the features of different layers and used for cross-modal retrieval is generated, and the average accuracy of retrieval can be effectively improved when the hash code is used for cross-modal retrieval;

(3) according to the method and the device for generating the hash code by utilizing the multi-layer feature fusion, provided by the invention, through three loss functions of an inter-modal similarity loss function, an intra-modal similarity loss function and a quantization loss function in the hash process, the inter-modal similarity and the intra-modal semantic similarity are simultaneously kept, the information loss caused by dispersion is reduced, and the retrieval accuracy of the obtained hash code is higher.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings:

FIG. 1 is a flowchart of a method for generating a hash code using multi-layer feature fusion according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of image feature fusion according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of text feature fusion according to an embodiment of the present invention;

FIG. 4a is a diagram of an input teletext pair according to an embodiment of the invention;

FIG. 4b is a diagram illustrating a hash result corresponding to FIG. 4 a;

FIG. 4c is a diagram illustrating an input picture according to an embodiment of the present invention;

FIG. 4d is a diagram illustrating a hash result corresponding to FIG. 4 c;

FIG. 4e is a diagram illustrating an input of text according to an embodiment of the present invention;

fig. 4f is a diagram illustrating a hash result corresponding to fig. 4 e.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

Fig. 1 is a flowchart illustrating a method for generating a hash code by using multi-layer feature fusion according to an embodiment of the present invention.

Referring to fig. 1, the method for generating a hash code by using multi-layer feature fusion of the present embodiment includes:

s11: establishing a similarity matrix of the image-text pairs;

establishing a similarity matrix S of image-text pairs according to label information in a data set, wherein the element S is an element S in the S_ijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the text_ijIs 1, otherwise is 0. Since each instance in the multi-tag dataset in this embodiment belongs to at least one semantic tag, it is specified that if the image and the text share at least one tag, they are similar, otherwise they are not.

S12: designing an image network model;

the method comprises the steps of obtaining features of different layers through output of different residual blocks, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to images through global pooling, full connection and discretization. In the embodiment, ResNet34 is used as a basic network to extract image features, and ResNet34 takes 4 residual blocks as an example to obtain four-layer features F1, F2, F3 and F4.

S13: designing a text network model;

and generating a corresponding multi-scale Bag-of-words model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolutional layer, fusing the features, and finally obtaining a hash code corresponding to the text through a full connection layer.

S14: designing a loss function by using the similarity matrix in the S11;

s15: training a model;

in this embodiment, an alternate learning strategy is adopted to learn the parameters in the network, that is, one of the parameters is updated, and the other parameters are controlled to be unchanged.

For an image network, randomly selecting two images and sending the two images into the image network, constraining by using a loss function in S14, training the images by using a Stochastic Gradient Descent optimization algorithm (SGD), and fixing text network parameters;

for the text network, two texts are randomly selected and sent into the text network, the loss function in S14 is utilized for constraint, SGD is used for training the texts, and meanwhile, image network parameters are fixed;

s16: obtaining a hash code;

and inputting the samples into the model trained by the S15 to obtain corresponding hash codes.

In a preferred embodiment, as shown in fig. 2, S12 further includes:

s122: for the front N-2 layers (the front two layers: F1, F2 in the embodiment), the number of channels of the layers is consistent with that of the N-1 layer (F3) by utilizing the convolution layers, and then the front N-2 layers (F1, F2) are downsampled by utilizing the pooling layers, so that the sizes of the characteristic diagrams of the layers are consistent with that of the N-1 layer (F3);

s123: for the N-th layer (F4), the number of channels of the layer is consistent with that of the N-1-th layer (F3) by utilizing the convolution layer, then the layer is up-sampled by utilizing the deconvolution layer, and the size of the characteristic diagram of the layer is consistent with that of the N-1-th layer (F3);

s124: and adding and fusing the features of the N different layers processed in the S122 and S123, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretizing.

In a preferred embodiment, as shown in fig. 3, S13 further includes:

s131: the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in the data set, and after the input vector is subjected to multi-scale fusion, the input vector can be regarded as a feature map with the length of N, the width of R and the number of channels of C, wherein R is the number of adopted different scales;

s132: converting the feature map obtained in the step S131 into a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text;

s133: and fusing the R eigenvectors obtained in the step S132 in an addition mode, and then performing full-connection layer discretization to obtain the hash code corresponding to the text.

In the preferred embodiment, the loss function in S14 includes: inter-modality (image and text) similarity loss function, intra-modality (image and image, or text and text) similarity loss function, quantization loss during hash transformation. Using the above loss function can make the hamming distance between images and text with similar semantics smaller and the hamming distance between dissimilar images and text larger. Here, the image and the text correspond to a hash code (B)^xAnd B^y) Being discrete values, optimizing them directly results in models that cannot be trained using SGD. Thus, here the output F and G of the last fully-connected layer in the image and text network is treated as hash codes with consecutive values, replacing B with it^xAnd B^yTrained with other parameters in the network. Discretizing F and G in other stages to obtain B^xAnd B^y。

Further, S14 includes:

s141: inter-modal similarity loss function:

to preserve the similarity between modalities, the similarity between F and G is measured using a pair-wise likelihood function:

wherein the content of the first and second substances,

the inter-modal similarity loss function is then:

s142: the intra-modal similarity loss function:

to preserve semantic similarity within a modality, for an image modality, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

s143: in order to reduce the information loss caused by the dispersion, the quantization loss function in the hash process is as follows:

the final loss function is:

in the preferred embodiment, the input at S16 needs to contain both images and text with the same semantic meaning during training, and the corresponding hash code can be generated even if the sample contains only one of the modalities at S16, as shown in fig. 4a-4 f. If using F_I(x_i；θ_x) And F_T(t_j；θ_y) Representing an image network and a text network, respectively, then this step can be represented as:

b^x＝sign(F_I(x_i；θ_x)) (7)

b^y＝sign(F_T(t_j；θ_y))

(8)

wherein b is^xAnd b^yAre respectively B^xAnd B^yThe element in (1), sign (x), is a sign function, and the expression is as follows:

table 1 shows the hash code of the present invention compared with the Average accuracy (MAP) of the prior art deep cross-modal hash (DCMH) and the prior art self-supervised cross-modal hash (SSAH) over three widely used data sets MIRFLICKR-25K, NUS-WIDE and IAPR TC-12. For convenience, the case of image query text is represented by I2T, and the case of text query image is represented by T2I. Using hamming sorting widely used in retrieval as an evaluation criterion, the hamming sorting calculates hamming distances of query objects and objects in the database from the generated hash codes and sorts them in a distance increasing manner. The Average accuracy (MAP) is widely used to measure the accuracy of hamming sorting, and a higher MAP indicates better model performance.

TABLE 1

As shown in Table 1, the MAP results for the inventive method are significantly higher than for the other comparative methods at MIRFLICKR-25K and NUS-WIDE data sets; on the IAPR TC-12 dataset, the method MAP results were slightly higher in the I2T case than the other comparative methods, while the MAP results were only slightly decreased in the T2I case. In one embodiment, an apparatus for generating a hash code using multi-layer feature fusion is further provided, which includes: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,

the image-text pair similarity matrix establishing module is used for establishing a similarity matrix S of the image-text pair according to the label information in the data set, and if the image is similar to the text, the S is_ijIs 1, otherwise is 0;

the image network model design module is used for acquiring the features of different layers through the fast output of different residual errors, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining the hash codes corresponding to the images through global pooling, full connection and discretization;

the text network model design module is used for generating a corresponding multi-scale Bag-of-words model for each text by using the multi-scale fusion module, then obtaining characteristics of different scales through the convolution layer and fusing the characteristics, and finally obtaining a hash code corresponding to the text through the full connection layer;

the loss function design module is used for designing a loss function;

the model training module is used for randomly selecting two images for the image network and sending the two images into the image network, utilizing the loss function in the loss function design module to carry out constraint, using the SGD to train the loss function and fixing text network parameters; for a text network, randomly selecting two texts to be sent into the text network, utilizing a loss function in a loss function design module to carry out constraint, using SGD to train the texts, and simultaneously fixing image network parameters;

and the hash code obtaining module is used for inputting the sample into the model obtained by the training of the model training module to obtain the corresponding hash code.

In a preferred embodiment, the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,

the front N-2-layer characteristic diagram adjusting module is used for enabling the channel number of the layers to be consistent with that of the (N-1) th layer by utilizing the convolution layer for the front N-2 layers, and then conducting down-sampling on the front N-2 layers by utilizing the pooling layer to enable the size of the characteristic diagrams of the layers to be consistent with that of the (N-1) th layer;

and the hash code acquisition module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjustment module and the Nth layer feature map adjustment module, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretization.

In a preferred embodiment, the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,

the multi-scale fusion module is used for changing the input vector into a feature map with the length of N, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in the data set, and R is the number of adopted different scales;

and the hash code acquisition module corresponding to the text is used for fusing the R characteristic vectors obtained by the residual block in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.

In a preferred embodiment, the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,

wherein the content of the first and second substances,

the inter-modal similarity loss function is then:

for image modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

the final loss function is:

the embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and not to limit the invention. Any modifications and variations within the scope of the description, which may occur to those skilled in the art, are intended to be within the scope of the invention.

Claims

1. A method for generating a hash code by utilizing multi-layer feature fusion is characterized by comprising the following steps:

s11: establishing a similarity matrix of the image-text pairs;

establishing a similarity matrix S of image-text pairs according to label information in a data set, wherein the element S is an element S in the S_ijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the text_ijIs 1, otherwise is 0;

s12: designing an image network model;

s13: designing a text network model;

s14: designing a loss function by using the similarity matrix S in the S11;

s15: training a model;

s16: obtaining a hash code;

inputting a sample into the model obtained by the training of the S15 to obtain a corresponding hash code;

the S14 further includes:

the loss function includes: an inter-modality similarity loss function, an intra-modality similarity loss function, and a quantization loss function in a hash process;

the S14 is further:

s141: the inter-modal similarity loss function:

wherein the content of the first and second substances,

then the inter-modal similarity loss function is：

S142: the intra-modal similarity loss function:

for image modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

s143: the quantization loss function in the hash process is:

the final loss function is:

obtaining the hash code is further represented as:

b^x＝sign(F_X(x_i；θ_X))

b^y＝sign(F_Y(y_j；θ_Y))

2. the method for generating hash codes using multi-layer feature fusion according to claim 1, wherein said S12 further comprises:

3. The method for generating hash codes using multi-layer feature fusion according to claim 1, wherein said S13 further comprises:

s131: the input vector is regarded as a feature vector with the length of L and the width of 1, wherein L is the number of selected words in a data set, and after the input vector is subjected to multi-scale fusion, the input vector can be regarded as a feature map with the length of L, the width of R and the number of channels of C, wherein R is the number of adopted different scales, and the dimension corresponds to semantic information under different scales;

s132: sending the feature map obtained in the step S131 into a residual block to obtain a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text;

s133: and fusing the R feature vectors corresponding to different semantic information obtained in the step S132 in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.

4. An apparatus for generating a hash code using multi-layer feature fusion, comprising: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,

the similarity matrix establishing module of the image-text pairs is used for establishing similarity matrixes S of the image-text pairs according to label information in the data set_ijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the text_ijIs 1, otherwise is 0;

the loss function design module is used for designing a loss function;

the hash code obtaining module is used for inputting a sample into a model obtained by training of the model training module to obtain a corresponding hash code;

the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,

wherein the content of the first and second substances,

the inter-modal similarity loss function is then:

for image modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

for text modalities, the intra-class similarity loss function is:

wherein the content of the first and second substances,

the final loss function is:

obtaining the hash code is further represented as:

b^x＝sign(F_X(x_i；θ_X))

b^y＝sign(F_Y(y_j；θ_Y))

5. the apparatus for generating hash codes using multi-layer feature fusion according to claim 4, wherein the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,

and the hash code obtaining module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjusting module and the Nth layer feature map adjusting module, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretization.

6. The apparatus for generating hash codes using multi-layer feature fusion according to claim 4, wherein the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,

the multi-scale fusion module is used for changing an input vector into a feature map with the length of L, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of L and the width of 1, wherein L is the number of selected words in a data set, and R is the number of adopted different scales;

the residual block is used for acquiring text global information and changing the feature map obtained by the multi-scale fusion module into a feature map with the length of 1, the width of R and the number of channels of C;