CN112559810B - Method and device for generating hash code by utilizing multi-layer feature fusion - Google Patents

Method and device for generating hash code by utilizing multi-layer feature fusion Download PDF

Info

Publication number
CN112559810B
CN112559810B CN202011533344.0A CN202011533344A CN112559810B CN 112559810 B CN112559810 B CN 112559810B CN 202011533344 A CN202011533344 A CN 202011533344A CN 112559810 B CN112559810 B CN 112559810B
Authority
CN
China
Prior art keywords
layer
text
loss function
module
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011533344.0A
Other languages
Chinese (zh)
Other versions
CN112559810A (en
Inventor
马然
余海波
苏敏
安平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202011533344.0A priority Critical patent/CN112559810B/en
Publication of CN112559810A publication Critical patent/CN112559810A/en
Application granted granted Critical
Publication of CN112559810B publication Critical patent/CN112559810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method and a device for producing hash codes by utilizing multi-layer feature fusion, comprising the following steps: establishing a similarity matrix of the image-text pairs; obtaining the characteristics of different layers through the output of different residual blocks, converting the characteristics of the different layers into characteristic graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to the images through global pooling, full connection and discretization; generating a corresponding multi-scale BOW model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolution layer, fusing, and finally obtaining hash codes corresponding to the texts through a full connection layer; designing a loss function; training a model; and inputting the sample into the model obtained by training to obtain the corresponding hash code. By the method and the device, the generated hash code has stronger distinguishability, and the average accuracy of retrieval can be effectively improved when the method and the device are used for cross-modal retrieval.

Description

Method and device for generating hash code by utilizing multi-layer feature fusion
Technical Field
The invention relates to the technical field of image retrieval, in particular to a method and a device for generating a hash code by utilizing multi-layer feature fusion.
Background
With the rapid development of networks, more and more data of different modalities, such as images, texts, etc., appear on the internet. The traditional single-mode retrieval cannot meet the requirements of people, and therefore cross-mode retrieval is proposed. The goal of cross-modality retrieval is to find semantically similar instances in one modality (e.g., text) with queries from another modality (e.g., images). However, similarity measures between different modality data are very challenging due to the heterogeneity differences between different modality data, and the semantic gap between low-level features and high-level semantics. A common way to compensate for this difference is to map the different modality data into a common subspace and then measure the similarity between them in the common space.
The cross-modal retrieval based on the Hash maps high-dimensional data of different modalities to a low-dimensional public Hamming space through a series of designed Hash functions, and the learned Hash codes keep semantic information of original data. The compact hash codes have smaller storage cost relative to high-dimensional image features, and the Hamming distance between the hash codes can be calculated through the XOR operation between bits, so that the operation speed is greatly improved. The traditional cross-modal hash algorithm is based on manual design to extract the characteristics of different modal data, and then the extracted characteristics are utilized to generate hash codes corresponding to the data. The feature extraction and the Hash learning are two relatively independent processes, information feedback does not exist between the two processes, so that the extracted features and the Hash learning cannot be well adapted, the performance of the model is limited by the expression capability of manual features, and a semantic gap is easily generated when the retrieval system carries out cross-mode retrieval.
In recent years, Deep Convolutional Neural Networks (DCNN) has exhibited strong feature extraction capability in the fields of image recognition, target detection and the like, and therefore some work has proposed a modal hash search algorithm based on Deep learning by combining the DCNN with a hash algorithm. One representative work is Deep Cross-Modal Hashing (DCMH) method proposed by Jiang, Qingyuan, et al, "Deep Cross-modular Hashing" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition works, 2017, which extracts features of images and texts by using DCNN, then maps the features into a common hamming space, and integrates the feature extraction and the hash learning into a frame capable of end-to-end training through a pre-designed loss function, so that the performance of the method is greatly improved compared with the performance of the traditional algorithm. Li, Chao, et al, "Self-Supervised adaptive Hashing Networks for Cross-Modal Retrieval," Proceedings of the IEEE Conference on Computer Vision and Pattern registration works, 2018. Deep Visual Semantic Hashing (DVSH) method introduced into a Long Short Term Memory network (LSTM) to capture the intrinsic association between images and text is proposed by Cao, Yue, et al, "Deep Visual-Semantic Hashing for Cross-Modal retrieval," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining.2016. The above methods based on Deep convolutional Deep network Learning only use the features extracted from a certain layer of the Deep convolutional neural network to represent data of different modalities, and the extracted features are usually high-layer features in the network, such as output of fc8 layer of VGG network proposed by simony, Karen, and Andrew Zisserman, "right release communication Networks for target-Scale Image registration.
Therefore, it is urgently needed to provide a hash code with stronger discriminative performance and higher cross-modal retrieval accuracy.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method and a device for generating a hash code by utilizing multi-layer feature fusion, the multi-layer feature fusion is utilized to obtain the features with higher robustness, the generated hash code has stronger distinguishability, and the average accuracy of retrieval can be effectively improved when the method and the device are used for cross-modal retrieval.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention provides a method for generating a hash code by utilizing multilayer feature fusion, which comprises the following steps:
s11: establishing a similarity matrix of the image-text pairs;
establishing a similarity matrix S of the image-text pairs according to the label information in the data set, and if the image is similar to the text, SijIs 1, otherwise is 0;
s12: designing an image network model;
obtaining the characteristics of different layers through the output of different residual blocks, converting the characteristics of the different layers into characteristic graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to the images through global pooling, full connection and discretization;
s13: designing a text network model;
generating a corresponding multi-scale Bag-of-words model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolutional layer, fusing the features, and finally obtaining a hash code corresponding to the text through a full connection layer;
s14: designing a loss function by using the similarity matrix S in the S11;
s15: training a model;
for the image network, randomly selecting two images and sending the two images into the image network, utilizing the loss function in the S14 to carry out constraint, using SGD to train the images, and simultaneously fixing text network parameters;
for the text network, randomly selecting two texts to be sent into the text network, constraining by using the loss function in the S14, training the texts by using the SGD, and fixing image network parameters;
s16: obtaining a hash code;
and inputting the sample into the model obtained by the training of the S15 to obtain a corresponding hash code.
Preferably, the S12 further includes:
s121: obtaining the characteristics of N different layers through the output of different residual blocks;
s122: for the first N-2 layers, the channel numbers of the layers are consistent with those of the (N-1) th layer by utilizing the convolution layer, and then the front N-2 layers are downsampled by utilizing the pooling layer, so that the sizes of the characteristic graphs of the layers are consistent with those of the (N-1) th layer;
s123: for the Nth layer, the number of channels of the layer is made to be consistent with that of the (N-1) th layer by using the convolution layer, then the layer is up-sampled by using the deconvolution layer, and the size of the characteristic diagram of the layer is made to be consistent with that of the (N-1) th layer;
s124: and adding and fusing the features of the N different layers processed in the S122 and S123, and then obtaining the hash code corresponding to the image through the global pooling layer and the full-connection layer and discretizing.
Preferably, the S13 further includes:
s131: the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in a data set, and the input vector can be regarded as a feature map with the length of N and the width of R after multi-scale fusion, wherein R is the number of adopted different scales, and the dimension corresponds to semantic information under different scales;
s132: sending the feature map obtained in the step S131 into a residual block to obtain a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text; s133: and fusing the R feature vectors corresponding to different semantic information obtained in the step S132 in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.
Preferably, the S14 further includes:
the loss function includes: inter-modality similarity loss function, intra-modality similarity loss function, quantization loss function in the hash process.
Preferably, the S14 further comprises:
s141: the inter-modal similarity loss function:
the similarity between F and G is measured using a pair-wise likelihood function:
Figure BDA0002852577090000051
wherein the content of the first and second substances,
Figure BDA0002852577090000052
the inter-modal similarity loss function is then:
Figure BDA0002852577090000053
s142: the intra-modal similarity loss function:
for image modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000054
wherein the content of the first and second substances,
Figure BDA0002852577090000055
for text modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000056
wherein the content of the first and second substances,
Figure BDA0002852577090000057
s143: the quantization loss function in the hash process is:
Figure BDA0002852577090000058
the final loss function is:
Figure BDA0002852577090000059
preferably, obtaining the hash code is further represented by:
bx=sign(FI(xi;θx))
by=sign(FT(tj;θy))
wherein, bxAnd byHash codes B respectively corresponding to imagesxHash code B corresponding to textyThe sign (x) is a sign function, and the expression is as follows:
Figure BDA0002852577090000061
the invention also provides a device for generating the hash code by utilizing the multilayer feature fusion, which comprises the following steps: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,
the similarity matrix establishing module of the image-text pairs is used for establishing a similarity matrix S of the image-text pairs according to the label information in the data set, and if the image is similar to the text, the S isijIs 1, otherwise is 0;
the image network model design module is used for acquiring the features of different layers through the output of different residual blocks, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining the hash codes corresponding to the images through global pooling, full connection and discretization;
the text network model design module is used for generating a corresponding multi-scale Bag-of-words model for each text by using the multi-scale fusion module, then obtaining features of different scales through the convolution layer and fusing the features, and finally obtaining a hash code corresponding to the text through the full connection layer;
the loss function design module is used for designing a loss function;
the model training module is used for randomly selecting two images for the image network and sending the two images into the image network, utilizing the loss function in the loss function design module to carry out constraint, using SGD to train the loss function and fixing text network parameters at the same time; for a text network, randomly selecting two texts to be sent into the text network, utilizing a loss function in the loss function design module to carry out constraint, using SGD to train the texts, and simultaneously fixing image network parameters;
the hash code obtaining module is used for inputting the samples into the model obtained by the training of the model training module to obtain the corresponding hash codes.
Preferably, the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,
the multi-residual module is used for acquiring the characteristics of N different layers;
the front N-2-layer characteristic diagram adjusting module is used for enabling the channel number of the layers to be consistent with that of the (N-1) th layer by utilizing the convolution layer for the front N-2 layers, and then carrying out down-sampling on the front N-2 layers by utilizing the pooling layer to enable the size of the characteristic diagrams of the layers to be consistent with that of the (N-1) th layer;
the N layer characteristic diagram adjusting module is used for enabling the number of channels of the N layer to be consistent with that of the N-1 layer by utilizing the convolution layer, then utilizing the deconvolution layer to perform upsampling on the N layer, and enabling the size of the characteristic diagram of the N layer to be consistent with that of the N-1 layer;
and the hash code acquisition module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjustment module and the Nth layer feature map adjustment module, and then obtaining the hash code corresponding to the image through a global pooling layer and a full connection layer and discretization.
Preferably, the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,
the multi-scale fusion module is used for changing an input vector into a feature map with the length of N, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in a data set, and R is the number of adopted different scales;
the residual block is used for converting the feature map obtained by the multi-scale fusion module into a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain the global information of the text;
and the hash code acquisition module corresponding to the text is used for fusing the R feature vectors corresponding to different semantic information and obtained by the residual block in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.
Preferably, the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,
the inter-modal similarity loss function module is used for designing an inter-modal similarity loss function;
the similarity between F and G is measured using a pair-wise likelihood function:
Figure BDA0002852577090000071
wherein the content of the first and second substances,
Figure BDA0002852577090000081
the inter-modal similarity loss function is then:
Figure BDA0002852577090000082
the intra-modal similarity loss function module is used for designing an intra-modal similarity loss function;
for image modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000083
wherein the content of the first and second substances,
Figure BDA0002852577090000084
for text modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000085
wherein the content of the first and second substances,
Figure BDA0002852577090000086
the quantization loss function module designs a quantization loss function in the hash process;
Figure BDA0002852577090000087
the final loss function is:
Figure BDA0002852577090000088
compared with the prior art, the invention has the following advantages:
(1) according to the method and the device for generating the hash code by utilizing the fusion of the multilayer features, a more robust feature is obtained for each example by fusing different layer features, the feature simultaneously comprises high-layer semantic information and low-layer spatial information, the detail information and abstract information of different layers are fully utilized, the expression capability of the different layer features is fully utilized, and then the more discriminative hash code is generated for each example;
(2) according to the method and the device for generating the hash code by utilizing the multi-layer feature fusion, the features of different layers of the image and the text are extracted, so that the hash code which is generated by fusing the features of different layers and used for cross-modal retrieval is generated, and the average accuracy of retrieval can be effectively improved when the hash code is used for cross-modal retrieval;
(3) according to the method and the device for generating the hash code by utilizing the multi-layer feature fusion, provided by the invention, through three loss functions of an inter-modal similarity loss function, an intra-modal similarity loss function and a quantization loss function in the hash process, the inter-modal similarity and the intra-modal semantic similarity are simultaneously kept, the information loss caused by dispersion is reduced, and the retrieval accuracy of the obtained hash code is higher.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings:
FIG. 1 is a flowchart of a method for generating a hash code using multi-layer feature fusion according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of image feature fusion according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of text feature fusion according to an embodiment of the present invention;
FIG. 4a is a diagram of an input teletext pair according to an embodiment of the invention;
FIG. 4b is a diagram illustrating a hash result corresponding to FIG. 4 a;
FIG. 4c is a diagram illustrating an input picture according to an embodiment of the present invention;
FIG. 4d is a diagram illustrating a hash result corresponding to FIG. 4 c;
FIG. 4e is a diagram illustrating an input of text according to an embodiment of the present invention;
fig. 4f is a diagram illustrating a hash result corresponding to fig. 4 e.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
Fig. 1 is a flowchart illustrating a method for generating a hash code by using multi-layer feature fusion according to an embodiment of the present invention.
Referring to fig. 1, the method for generating a hash code by using multi-layer feature fusion of the present embodiment includes:
s11: establishing a similarity matrix of the image-text pairs;
establishing a similarity matrix S of image-text pairs according to label information in a data set, wherein the element S is an element S in the SijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the textijIs 1, otherwise is 0. Since each instance in the multi-tag dataset in this embodiment belongs to at least one semantic tag, it is specified that if the image and the text share at least one tag, they are similar, otherwise they are not.
S12: designing an image network model;
the method comprises the steps of obtaining features of different layers through output of different residual blocks, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to images through global pooling, full connection and discretization. In the embodiment, ResNet34 is used as a basic network to extract image features, and ResNet34 takes 4 residual blocks as an example to obtain four-layer features F1, F2, F3 and F4.
S13: designing a text network model;
and generating a corresponding multi-scale Bag-of-words model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolutional layer, fusing the features, and finally obtaining a hash code corresponding to the text through a full connection layer.
S14: designing a loss function by using the similarity matrix in the S11;
s15: training a model;
in this embodiment, an alternate learning strategy is adopted to learn the parameters in the network, that is, one of the parameters is updated, and the other parameters are controlled to be unchanged.
For an image network, randomly selecting two images and sending the two images into the image network, constraining by using a loss function in S14, training the images by using a Stochastic Gradient Descent optimization algorithm (SGD), and fixing text network parameters;
for the text network, two texts are randomly selected and sent into the text network, the loss function in S14 is utilized for constraint, SGD is used for training the texts, and meanwhile, image network parameters are fixed;
s16: obtaining a hash code;
and inputting the samples into the model trained by the S15 to obtain corresponding hash codes.
In a preferred embodiment, as shown in fig. 2, S12 further includes:
s121: obtaining the characteristics of N different layers through the output of different residual blocks;
s122: for the front N-2 layers (the front two layers: F1, F2 in the embodiment), the number of channels of the layers is consistent with that of the N-1 layer (F3) by utilizing the convolution layers, and then the front N-2 layers (F1, F2) are downsampled by utilizing the pooling layers, so that the sizes of the characteristic diagrams of the layers are consistent with that of the N-1 layer (F3);
s123: for the N-th layer (F4), the number of channels of the layer is consistent with that of the N-1-th layer (F3) by utilizing the convolution layer, then the layer is up-sampled by utilizing the deconvolution layer, and the size of the characteristic diagram of the layer is consistent with that of the N-1-th layer (F3);
s124: and adding and fusing the features of the N different layers processed in the S122 and S123, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretizing.
In a preferred embodiment, as shown in fig. 3, S13 further includes:
s131: the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in the data set, and after the input vector is subjected to multi-scale fusion, the input vector can be regarded as a feature map with the length of N, the width of R and the number of channels of C, wherein R is the number of adopted different scales;
s132: converting the feature map obtained in the step S131 into a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text;
s133: and fusing the R eigenvectors obtained in the step S132 in an addition mode, and then performing full-connection layer discretization to obtain the hash code corresponding to the text.
In the preferred embodiment, the loss function in S14 includes: inter-modality (image and text) similarity loss function, intra-modality (image and image, or text and text) similarity loss function, quantization loss during hash transformation. Using the above loss function can make the hamming distance between images and text with similar semantics smaller and the hamming distance between dissimilar images and text larger. Here, the image and the text correspond to a hash code (B)xAnd By) Being discrete values, optimizing them directly results in models that cannot be trained using SGD. Thus, here the output F and G of the last fully-connected layer in the image and text network is treated as hash codes with consecutive values, replacing B with itxAnd ByTrained with other parameters in the network. Discretizing F and G in other stages to obtain BxAnd By
Further, S14 includes:
s141: inter-modal similarity loss function:
to preserve the similarity between modalities, the similarity between F and G is measured using a pair-wise likelihood function:
Figure BDA0002852577090000121
wherein the content of the first and second substances,
Figure BDA0002852577090000122
the inter-modal similarity loss function is then:
Figure BDA0002852577090000123
s142: the intra-modal similarity loss function:
to preserve semantic similarity within a modality, for an image modality, the intra-class similarity loss function is:
Figure BDA0002852577090000124
wherein the content of the first and second substances,
Figure BDA0002852577090000125
for text modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000126
wherein the content of the first and second substances,
Figure BDA0002852577090000127
s143: in order to reduce the information loss caused by the dispersion, the quantization loss function in the hash process is as follows:
Figure BDA0002852577090000128
the final loss function is:
Figure BDA0002852577090000129
in the preferred embodiment, the input at S16 needs to contain both images and text with the same semantic meaning during training, and the corresponding hash code can be generated even if the sample contains only one of the modalities at S16, as shown in fig. 4a-4 f. If using FI(xi;θx) And FT(tj;θy) Representing an image network and a text network, respectively, then this step can be represented as:
bx=sign(FI(xi;θx)) (7)
by=sign(FT(tj;θy))
(8)
wherein b isxAnd byAre respectively BxAnd ByThe element in (1), sign (x), is a sign function, and the expression is as follows:
Figure BDA0002852577090000131
table 1 shows the hash code of the present invention compared with the Average accuracy (MAP) of the prior art deep cross-modal hash (DCMH) and the prior art self-supervised cross-modal hash (SSAH) over three widely used data sets MIRFLICKR-25K, NUS-WIDE and IAPR TC-12. For convenience, the case of image query text is represented by I2T, and the case of text query image is represented by T2I. Using hamming sorting widely used in retrieval as an evaluation criterion, the hamming sorting calculates hamming distances of query objects and objects in the database from the generated hash codes and sorts them in a distance increasing manner. The Average accuracy (MAP) is widely used to measure the accuracy of hamming sorting, and a higher MAP indicates better model performance.
TABLE 1
Figure BDA0002852577090000132
As shown in Table 1, the MAP results for the inventive method are significantly higher than for the other comparative methods at MIRFLICKR-25K and NUS-WIDE data sets; on the IAPR TC-12 dataset, the method MAP results were slightly higher in the I2T case than the other comparative methods, while the MAP results were only slightly decreased in the T2I case. In one embodiment, an apparatus for generating a hash code using multi-layer feature fusion is further provided, which includes: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,
the image-text pair similarity matrix establishing module is used for establishing a similarity matrix S of the image-text pair according to the label information in the data set, and if the image is similar to the text, the S isijIs 1, otherwise is 0;
the image network model design module is used for acquiring the features of different layers through the fast output of different residual errors, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining the hash codes corresponding to the images through global pooling, full connection and discretization;
the text network model design module is used for generating a corresponding multi-scale Bag-of-words model for each text by using the multi-scale fusion module, then obtaining characteristics of different scales through the convolution layer and fusing the characteristics, and finally obtaining a hash code corresponding to the text through the full connection layer;
the loss function design module is used for designing a loss function;
the model training module is used for randomly selecting two images for the image network and sending the two images into the image network, utilizing the loss function in the loss function design module to carry out constraint, using the SGD to train the loss function and fixing text network parameters; for a text network, randomly selecting two texts to be sent into the text network, utilizing a loss function in a loss function design module to carry out constraint, using SGD to train the texts, and simultaneously fixing image network parameters;
and the hash code obtaining module is used for inputting the sample into the model obtained by the training of the model training module to obtain the corresponding hash code.
In a preferred embodiment, the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,
the multi-residual module is used for acquiring the characteristics of N different layers;
the front N-2-layer characteristic diagram adjusting module is used for enabling the channel number of the layers to be consistent with that of the (N-1) th layer by utilizing the convolution layer for the front N-2 layers, and then conducting down-sampling on the front N-2 layers by utilizing the pooling layer to enable the size of the characteristic diagrams of the layers to be consistent with that of the (N-1) th layer;
the N layer characteristic diagram adjusting module is used for enabling the number of channels of the N layer to be consistent with that of the N-1 layer by utilizing the convolution layer, then utilizing the deconvolution layer to perform upsampling on the N layer, and enabling the size of the characteristic diagram of the N layer to be consistent with that of the N-1 layer;
and the hash code acquisition module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjustment module and the Nth layer feature map adjustment module, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretization.
In a preferred embodiment, the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,
the multi-scale fusion module is used for changing the input vector into a feature map with the length of N, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of N and the width of 1, wherein N is the number of selected words in the data set, and R is the number of adopted different scales;
the residual block is used for converting the feature map obtained by the multi-scale fusion module into a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain the global information of the text;
and the hash code acquisition module corresponding to the text is used for fusing the R characteristic vectors obtained by the residual block in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.
In a preferred embodiment, the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,
the inter-modal similarity loss function module is used for designing an inter-modal similarity loss function;
the similarity between F and G is measured using a pair-wise likelihood function:
Figure BDA0002852577090000151
wherein the content of the first and second substances,
Figure BDA0002852577090000152
the inter-modal similarity loss function is then:
Figure BDA0002852577090000161
the intra-modal similarity loss function module is used for designing an intra-modal similarity loss function;
for image modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000162
wherein the content of the first and second substances,
Figure BDA0002852577090000163
for text modalities, the intra-class similarity loss function is:
Figure BDA0002852577090000164
wherein the content of the first and second substances,
Figure BDA0002852577090000165
the quantization loss function module designs a quantization loss function in the hash process;
Figure BDA0002852577090000166
the final loss function is:
Figure BDA0002852577090000167
the embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and not to limit the invention. Any modifications and variations within the scope of the description, which may occur to those skilled in the art, are intended to be within the scope of the invention.

Claims (6)

1. A method for generating a hash code by utilizing multi-layer feature fusion is characterized by comprising the following steps:
s11: establishing a similarity matrix of the image-text pairs;
establishing a similarity matrix S of image-text pairs according to label information in a data set, wherein the element S is an element S in the SijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the textijIs 1, otherwise is 0;
s12: designing an image network model;
obtaining the characteristics of different layers through the output of different residual blocks, converting the characteristics of the different layers into characteristic graphs with consistent channel number and size, then fusing, and finally obtaining hash codes corresponding to the images through global pooling, full connection and discretization;
s13: designing a text network model;
generating a corresponding multi-scale Bag-of-words model for each text by using a multi-scale fusion module, then obtaining features of different scales through a convolutional layer, fusing the features, and finally obtaining a hash code corresponding to the text through a full connection layer;
s14: designing a loss function by using the similarity matrix S in the S11;
s15: training a model;
for the image network, randomly selecting two images and sending the two images into the image network, utilizing the loss function in the S14 to carry out constraint, using SGD to train the images, and simultaneously fixing text network parameters;
for the text network, randomly selecting two texts to be sent into the text network, constraining by using the loss function in the S14, training the texts by using the SGD, and fixing image network parameters;
s16: obtaining a hash code;
inputting a sample into the model obtained by the training of the S15 to obtain a corresponding hash code;
the S14 further includes:
the loss function includes: an inter-modality similarity loss function, an intra-modality similarity loss function, and a quantization loss function in a hash process;
the S14 is further:
s141: the inter-modal similarity loss function:
the similarity between F and G is measured using a pair-wise likelihood function:
Figure FDA0003520777680000021
wherein the content of the first and second substances,
Figure FDA0003520777680000022
then the inter-modal similarity loss function is:
Figure FDA0003520777680000023
S142: the intra-modal similarity loss function:
for image modalities, the intra-class similarity loss function is:
Figure FDA0003520777680000024
wherein the content of the first and second substances,
Figure FDA0003520777680000025
for text modalities, the intra-class similarity loss function is:
Figure FDA0003520777680000026
wherein the content of the first and second substances,
Figure FDA0003520777680000027
s143: the quantization loss function in the hash process is:
Figure FDA0003520777680000028
the final loss function is:
Figure FDA0003520777680000029
obtaining the hash code is further represented as:
bx=sign(FX(xi;θX))
by=sign(FY(yj;θY))
wherein, bxAnd byHash codes B respectively corresponding to imagesxHash code B corresponding to textyThe sign (x) is a sign function, and the expression is as follows:
Figure FDA0003520777680000031
2. the method for generating hash codes using multi-layer feature fusion according to claim 1, wherein said S12 further comprises:
s121: obtaining the characteristics of N different layers through the output of different residual blocks;
s122: for the first N-2 layers, the channel numbers of the layers are consistent with those of the (N-1) th layer by utilizing the convolution layer, and then the front N-2 layers are downsampled by utilizing the pooling layer, so that the sizes of the characteristic graphs of the layers are consistent with those of the (N-1) th layer;
s123: for the Nth layer, the number of channels of the layer is made to be consistent with that of the (N-1) th layer by using the convolution layer, then the layer is up-sampled by using the deconvolution layer, and the size of the characteristic diagram of the layer is made to be consistent with that of the (N-1) th layer;
s124: and adding and fusing the features of the N different layers processed in the S122 and S123, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretizing.
3. The method for generating hash codes using multi-layer feature fusion according to claim 1, wherein said S13 further comprises:
s131: the input vector is regarded as a feature vector with the length of L and the width of 1, wherein L is the number of selected words in a data set, and after the input vector is subjected to multi-scale fusion, the input vector can be regarded as a feature map with the length of L, the width of R and the number of channels of C, wherein R is the number of adopted different scales, and the dimension corresponds to semantic information under different scales;
s132: sending the feature map obtained in the step S131 into a residual block to obtain a feature map with the length of 1, the width of R and the number of channels of C, so as to obtain global information of the text;
s133: and fusing the R feature vectors corresponding to different semantic information obtained in the step S132 in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.
4. An apparatus for generating a hash code using multi-layer feature fusion, comprising: the image-text pair similarity matrix building module, the image network model designing module, the text network model designing module, the loss function designing module, the model training module and the hash code obtaining module are used for building a similarity matrix of image-text pairs; wherein the content of the first and second substances,
the similarity matrix establishing module of the image-text pairs is used for establishing similarity matrixes S of the image-text pairs according to label information in the data setijRepresenting the similarity between the ith image and the jth text, S if the image is similar to the textijIs 1, otherwise is 0;
the image network model design module is used for acquiring the features of different layers through the output of different residual blocks, converting the features of the different layers into feature graphs with consistent channel number and size, then fusing, and finally obtaining the hash codes corresponding to the images through global pooling, full connection and discretization;
the text network model design module is used for generating a corresponding multi-scale Bag-of-words model for each text by using the multi-scale fusion module, then obtaining features of different scales through the convolution layer and fusing the features, and finally obtaining a hash code corresponding to the text through the full connection layer;
the loss function design module is used for designing a loss function;
the model training module is used for randomly selecting two images for the image network and sending the two images into the image network, utilizing the loss function in the loss function design module to carry out constraint, using SGD to train the loss function and fixing text network parameters at the same time; for a text network, randomly selecting two texts to be sent into the text network, utilizing a loss function in the loss function design module to carry out constraint, using SGD to train the texts, and simultaneously fixing image network parameters;
the hash code obtaining module is used for inputting a sample into a model obtained by training of the model training module to obtain a corresponding hash code;
the loss function design module further comprises: the system comprises an inter-modal similarity loss function module, an intra-modal similarity loss function module and a quantization loss function module; wherein the content of the first and second substances,
the inter-modal similarity loss function module is used for designing an inter-modal similarity loss function;
the similarity between F and G is measured using a pair-wise likelihood function:
Figure FDA0003520777680000041
wherein the content of the first and second substances,
Figure FDA0003520777680000042
the inter-modal similarity loss function is then:
Figure FDA0003520777680000043
the intra-modal similarity loss function module is used for designing an intra-modal similarity loss function;
for image modalities, the intra-class similarity loss function is:
Figure FDA0003520777680000051
wherein the content of the first and second substances,
Figure FDA0003520777680000052
for text modalities, the intra-class similarity loss function is:
Figure FDA0003520777680000053
wherein the content of the first and second substances,
Figure FDA0003520777680000054
the quantization loss function module designs a quantization loss function in the hash process;
Figure FDA0003520777680000055
the final loss function is:
Figure FDA0003520777680000056
obtaining the hash code is further represented as:
bx=sign(FX(xi;θX))
by=sign(FY(yj;θY))
wherein, bxAnd byHash codes B respectively corresponding to imagesxHash code B corresponding to textyThe sign (x) is a sign function, and the expression is as follows:
Figure FDA0003520777680000057
5. the apparatus for generating hash codes using multi-layer feature fusion according to claim 4, wherein the image network model design module further comprises: the system comprises a multi-residual module, a front N-2 layer characteristic diagram adjusting module, an N-th layer characteristic diagram adjusting module and a hash code obtaining module corresponding to an image; wherein the content of the first and second substances,
the multi-residual module is used for acquiring the characteristics of N different layers;
the front N-2-layer characteristic diagram adjusting module is used for enabling the channel number of the layers to be consistent with that of the (N-1) th layer by utilizing the convolution layer for the front N-2 layers, and then carrying out down-sampling on the front N-2 layers by utilizing the pooling layer to enable the size of the characteristic diagrams of the layers to be consistent with that of the (N-1) th layer;
the N layer characteristic diagram adjusting module is used for enabling the number of channels of the N layer to be consistent with that of the N-1 layer by utilizing the convolution layer, then utilizing the deconvolution layer to perform upsampling on the N layer, and enabling the size of the characteristic diagram of the N layer to be consistent with that of the N-1 layer;
and the hash code obtaining module corresponding to the image is used for adding and fusing the features of the N different layers processed by the front N-2 layer feature map adjusting module and the Nth layer feature map adjusting module, and then obtaining the hash code corresponding to the image through a full pooling layer and a full connection layer and discretization.
6. The apparatus for generating hash codes using multi-layer feature fusion according to claim 4, wherein the text network model design module comprises: the system comprises a multi-scale fusion module, a residual block and a hash code obtaining module corresponding to a text; wherein the content of the first and second substances,
the multi-scale fusion module is used for changing an input vector into a feature map with the length of L, the width of R and the number of channels of C when the input vector is regarded as a feature vector with the length of L and the width of 1, wherein L is the number of selected words in a data set, and R is the number of adopted different scales;
the residual block is used for acquiring text global information and changing the feature map obtained by the multi-scale fusion module into a feature map with the length of 1, the width of R and the number of channels of C;
and the hash code acquisition module corresponding to the text is used for fusing the R feature vectors corresponding to different semantic information and obtained by the residual block in an addition mode, and then obtaining the hash code corresponding to the text through a full connection layer and discretization.
CN202011533344.0A 2020-12-23 2020-12-23 Method and device for generating hash code by utilizing multi-layer feature fusion Active CN112559810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011533344.0A CN112559810B (en) 2020-12-23 2020-12-23 Method and device for generating hash code by utilizing multi-layer feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011533344.0A CN112559810B (en) 2020-12-23 2020-12-23 Method and device for generating hash code by utilizing multi-layer feature fusion

Publications (2)

Publication Number Publication Date
CN112559810A CN112559810A (en) 2021-03-26
CN112559810B true CN112559810B (en) 2022-04-08

Family

ID=75030845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011533344.0A Active CN112559810B (en) 2020-12-23 2020-12-23 Method and device for generating hash code by utilizing multi-layer feature fusion

Country Status (1)

Country Link
CN (1) CN112559810B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115410717A (en) * 2022-09-15 2022-11-29 北京京东拓先科技有限公司 Model training method, data retrieval method, image data retrieval method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101472451B1 (en) * 2010-11-04 2014-12-18 한국전자통신연구원 System and Method for Managing Digital Contents
CN104346440B (en) * 2014-10-10 2017-06-23 浙江大学 A kind of across media hash indexing methods based on neutral net
CN109271486B (en) * 2018-09-19 2021-11-26 九江学院 Similarity-preserving cross-modal Hash retrieval method
CN110059198B (en) * 2019-04-08 2021-04-13 浙江大学 Discrete hash retrieval method of cross-modal data based on similarity maintenance
CN111753189A (en) * 2020-05-29 2020-10-09 中山大学 Common characterization learning method for few-sample cross-modal Hash retrieval

Also Published As

Publication number Publication date
CN112559810A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
JP6629942B2 (en) Hierarchical automatic document classification and metadata identification using machine learning and fuzzy matching
Arevalo et al. Gated multimodal units for information fusion
CN110059217B (en) Image text cross-media retrieval method for two-stage network
US8150170B2 (en) Statistical approach to large-scale image annotation
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN111126396B (en) Image recognition method, device, computer equipment and storage medium
CN110929080B (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
CN111461174B (en) Multi-mode label recommendation model construction method and device based on multi-level attention mechanism
WO2021098585A1 (en) Image search based on combined local and global information
CN112100346A (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN112163114B (en) Image retrieval method based on feature fusion
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN115017911A (en) Cross-modal processing for vision and language
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN114491115B (en) Multi-model fusion integrated image retrieval method based on deep hash
CN114298122A (en) Data classification method, device, equipment, storage medium and computer program product
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion
CN112559810B (en) Method and device for generating hash code by utilizing multi-layer feature fusion
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
CN115080699A (en) Cross-modal retrieval method based on modal specific adaptive scaling and attention network
Lakshmi An efficient telugu word image retrieval system using deep cluster
CN112784838A (en) Hamming OCR recognition method based on locality sensitive hashing network
Kabir et al. Content-Based Image Retrieval Using AutoEmbedder
CN116630726B (en) Multi-mode-based bird classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant