CN117909476A

CN117909476A - Cross-media retrieval method and model training method based on multi-bit hash code

Info

Publication number: CN117909476A
Application number: CN202410079935.7A
Authority: CN
Inventors: 张正; 吴清鹏
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-19

Abstract

The application discloses a multi-bit hash code-based cross-media retrieval method and a model training method, wherein the method fully captures global features and local semantic features of various media information by constructing a multi-bit hash code network model, and fully utilizes the local features with more abundant semantics by carrying out hierarchical alignment on the global features and the local semantic features of the various media information so as to fully utilize the local features with more abundant semantics with fine granularity to carry out Ji Yigou on multimedia data, thereby effectively reducing heterogeneity and semantic gap among different media. Meanwhile, the application utilizes the aligned characteristics to cooperatively and simultaneously generate a plurality of high-quality hash codes with different lengths, thereby solving the problem that the flexibility, the accuracy and the expandability of the cross-media retrieval are very limited due to the hash codes with fixed lengths, and improving the accuracy and the flexibility of the cross-media retrieval.

Description

Cross-media retrieval method and model training method based on multi-bit hash code

Technical Field

The application relates to the technical field of information, in particular to a multi-bit hash code-based cross-media retrieval method and a model training method.

Background

Cross-media retrieval is an important similarity search technique that refers to retrieving data semantically related to a given media query item (e.g., text) from another media representation (e.g., image). In particular, in the cross-media retrieval method, the cross-media hash method is one of the efficient retrieval methods, and can convert the original data into compact binary code representation, thereby performing quick retrieval. However, in the present day, for the cross-media hash method, one of the problems is that due to the huge data volume and complex heterogeneity of multimedia data, the heterogeneity refers to that different media data exist in different distributed "heterogeneous" feature spaces, which has a great influence on the accuracy of cross-media retrieval. Furthermore, another problem is that almost all existing cross-media hash methods can only learn single-bit hash codes, i.e. a hash code of fixed length, e.g. 32 bits, which also affects the accuracy of the cross-media retrieval.

Disclosure of Invention

The application aims to solve the technical problem of providing a cross-media retrieval method and a model training method based on a multi-bit hash code aiming at the defects of the prior art.

In order to solve the above technical problems, a first aspect of the present application provides a training method of a multi-bit hash code network model, where the training method of the multi-bit hash code network model specifically includes:

Acquiring a training sample set, and constructing a semantic similarity matrix by utilizing semantic tags of training samples in the training sample set, wherein the training sample set comprises a plurality of training batches, and each training batch comprises a plurality of image-text pairs;

Inputting each image-text pair in the training batch into an initial hash model, and determining text global features, text local feature sequences, image global features and image local feature sequences of the training samples through the initial hash model;

Aligning the text global feature with the image global feature to obtain an aligned text global feature and an aligned image global feature, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-modal contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic features and the image semantic features;

Determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features;

determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features;

Constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch;

and updating parameters of the initial hash model based on the cross-modal comparison loss item and the hash loss item to obtain a multi-bit hash code network model.

The multi-bit hash code network model training method comprises a feature extraction module, a cross-media comparison type alignment module and a multi-hash collaborative learning module, wherein the feature extraction module is connected with the cross-media comparison type alignment module, and the cross-media comparison type alignment module is connected with the multi-hash collaborative learning module, and the feature extraction module is used for extracting text global features, text local feature sequences, image global features and image local feature sequences; the cross-media contrast type alignment module is used for determining text fusion features and image fusion features according to the text global features, the image global features, the text local feature sequences and the image local feature sequences; the multi-hash collaborative learning module is used for determining at least two first hash features according to the text fusion features, determining at least two first hash codes based on the at least two first hash features, determining at least two second hash features according to the image fusion features, and determining at least two second hash codes based on the at least two second hash features.

The multi-bit hash code network model training method comprises the steps that the cross-media comparison type alignment module comprises a residual MLP unit, a local learning unit, a pooling layer and an adder, wherein the residual MLP unit is connected with the adder, the local learning unit is connected with the adder through the pooling layer, the local learning unit comprises a cross attention layer and a transducer layer which are sequentially connected, and query vectors of the cross attention layer are embedded for sharing concepts.

The training method of the multi-bit hash code network model, wherein the construction of the cross-mode contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic feature and the image semantic feature comprises the following specific steps:

For each image-text pair in a training batch, determining an aligned text global feature for the image-text pair And each aligned image global feature/>, in the training batchIs a first similarity of (2); aligned image global features of the image-text pairs/>And each aligned text global feature/>, in the training batchIs a second degree of similarity of (2);

determining a global contrast loss term according to all the determined first similarities and all the determined second similarities;

Determining each text semantic feature in the sequence of text semantic features for each image-text pair Image semantic features/>, in sequence of image semantic features with the image-text pairsIs a third similarity of each image semantic feature/>, in the image semantic feature sequence of the image-text pairEach text semantic feature/>, in a sequence of text semantic features with the image-text pairsIs a fourth degree of similarity;

determining a local contrast loss term according to all the determined third similarity and all the determined fourth similarity;

And determining a cross-modal contrast loss term according to the global contrast loss term and the local contrast loss term.

The method for training the multi-bit hash code network model, wherein constructing a hash loss term according to at least two first hash features, at least two second hash features and a semantic similarity matrix specifically comprises:

for each image-text pair in a training batch, determining each first hash feature of the image-text pair And the first hash feature/>, of each image-text pair in the training setEach second hash feature/>, of the image-text pairSecond hash feature/>, with each image-text pair in the training setIs a second hash feature inner product of (a); determining each first hash feature/>, of the image-text pairSecond hash feature/>, with each image-text pair in the training setIs a third hash feature inner product of the image-text pair, each second hash feature/>First hash feature/>, with each image-text pair in the training setIs a fourth hash feature inner product of (2);

Determining a loss item in the medium according to the determined inner products of all the first hash features, the determined inner products of all the second hash features and the semantic similarity matrix;

determining a loss item among media according to the determined third hash characteristic inner products, the determined fourth hash characteristic inner products and the semantic similarity matrix;

for each image-text pair in the training batch, according to each first hash feature And its corresponding second hash feature/>Determining hash feature b _i,w, and determining all hash features b _i,w, all first hash features/>And all second hash features/>Determining a hash quantization item;

and determining a hash loss item according to the intra-media loss item, the inter-media loss item and the hash quantization item.

The method for training the multi-bit hash code network model, wherein before updating the parameters of the initial hash model based on the cross-modal contrast loss item and the hash loss item to obtain the multi-bit hash code network model, the method further comprises:

For each image-text pair in the training batch, determining text-assisted hash features based on the text fusion features, and determining image-assisted hash features based on the image fusion features;

learning a text hash code according to the text auxiliary hash feature, learning an image hash code according to the image auxiliary hash feature, and determining an auxiliary hash code according to the text hash code and the image hash code;

Mapping at least two first hash features and at least two second hash features to a Hamming space where the text auxiliary hash features are located to obtain at least two first hash codes and at least two second hash codes;

Determining an auxiliary hash loss item according to the text auxiliary hash characteristic and the image auxiliary hash characteristic of the image-text pairs in the training batch, and determining a reconstruction loss item according to the auxiliary hash code, the at least two first hash codes and the at least two second hash codes;

and determining a correction loss term according to the auxiliary hash loss term and the reconstruction loss term, and correcting the hash loss term based on the correction loss term.

The second aspect of the embodiment of the application provides a multi-bit hash code-based cross-media retrieval method, wherein the multi-bit hash code-based cross-media retrieval method specifically comprises the following steps:

Acquiring a query sample, and determining at least a query hash code sequence of the query sample through the multi-bit hash code network model, wherein the query hash code sequence at least comprises two hash codes;

acquiring a target data set corresponding to the query sample, and determining a hash code sequence set corresponding to the target data set through the multi-bit hash code network model;

And determining target data corresponding to the query sample in the target data set according to the query hash code sequence and the hash code sequence set.

A third aspect of the present application provides a training device for a multi-bit hash code network model, where the device specifically includes:

The acquisition module is used for acquiring a training sample set and constructing a semantic similarity matrix by utilizing semantic tags of all training samples in the training sample set, wherein the training sample set comprises a plurality of training batches, and each training batch comprises a plurality of image-text pairs;

The initial hash model is used for inputting each image-text pair in the training batch into the initial hash model, and determining text global features, text local feature sequences, image global features and image local feature sequences of the training samples through the initial hash model; aligning the text global feature with the image global feature to obtain an aligned text global feature and an aligned image global feature, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-modal contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic features and the image semantic features; determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features; determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features; constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch;

And the parameter updating module is used for updating the parameters of the initial hash model based on the cross-modal comparison loss item and the hash loss item so as to obtain a multi-bit hash code network model.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium, where the computer readable storage medium stores one or more programs executable by one or more processors to implement steps in a method for training a multi-bit hash code network model as described in any one of the above.

A fifth aspect of an embodiment of the present application provides a terminal device, including: a processor and a memory;

The memory has stored thereon a computer readable program executable by the processor;

The processor, when executing the computer readable program, implements the steps in the training method of the multi-bit hash code network model as described in any one of the above.

The beneficial effects are that: compared with the prior art, the method and the device have the advantages that the global features and the local semantic tokens of various media information are obtained by constructing the multi-bit hash code network model, and the heterogeneity and the semantic gap among different media are effectively reduced by respectively carrying out hierarchical alignment on the global features and the local semantic tokens of the various media information. Meanwhile, a plurality of high-quality hash codes with different lengths are generated simultaneously through the global features and the local semantic tokens of all the media information after alignment, so that the efficiency and the accuracy of cross-media retrieval are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a network model structure diagram of a training method of a multi-bit hash code network model according to an embodiment of the present application.

Fig. 2 is a flowchart of an embodiment of a training method of a multi-bit hash code network model according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a training device for a multi-bit hash code network model according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The application provides a multi-bit hash code-based cross-media retrieval method and a model training method, which are used for making the purposes, the technical scheme and the effects of the application clearer and more definite, and the application is further described in detail below by referring to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the sequence number and the size of each step in this embodiment do not mean the sequence of execution, and the execution sequence of each process is determined by the function and the internal logic of each process, and should not be construed as limiting the implementation process of the embodiment of the present application.

Research shows that with the rapid development of intelligent devices and social media, multimedia data presents an unprecedented explosive growth trend. In addition, multimedia data is collected from different sources and represented in different forms, which constitute heterogeneous multimedia data, such as text, pictures, video, etc. Due to the huge data volume and complex heterogeneity of multimedia data, where heterogeneity refers to the presence of different media data in different distributed "heterogeneous" feature spaces, i.e. inconsistent feature representations and distributions, has a huge impact on the efficiency and accuracy of cross-media retrieval.

Cross-media retrieval is an important similarity search technique that refers to retrieving data from a media representation (e.g., an image) that is semantically related to a given media query item (e.g., text), that is, retrieving text semantically related to the image from the image. In particular, in the cross-media retrieval method, the cross-media hash method is one of the efficient retrieval methods, and can convert the original data into compact binary code representation, thereby performing quick retrieval. However, today, one of the problems with the cross-media hash method is how to mitigate heterogeneity among various media data to improve the accuracy of the cross-media retrieval. In addition, another problem is how to simultaneously generate hash codes with various lengths and high quality through a unified architecture, so that the accuracy and the retrieval efficiency of cross-media retrieval are improved, and different requirements of different platform systems on retrieval performance and retrieval efficiency are flexibly met.

Currently, many Cross-media retrieval (hash-based Cross MEDIA RETRIEVAL) methods based on hash learning have been proposed to support fast similarity searches across different media data. Existing cross-media retrieval methods can be divided into different sub-categories from different angles. Typically, the methods can be classified into unsupervised and supervised methods according to whether supervision information is utilized, shallow and deep methods according to whether deep learning is based, and single-bit and multi-bit methods according to the number of lengths of the model generated hash codes.

For the unsupervised and supervised cross-media hash methods, the existing cross-media hash methods can be roughly divided into two types, namely an unsupervised cross-media hash method and a supervised cross-media hash method, according to whether the manually noted semantic tags are involved or not.

The unsupervised cross-media hash method focuses on preserving semantic similarity by exploring the correlation between cross-media data pairs without using supervision information (i.e., pair-wise similarity matrix or semantic tags). The unsupervised approach can be easily extended to handle cases where cross-media retrieval tasks lack tag information, or where tag information is expensive to obtain.

Notably, the unsupervised cross-media hash method focuses on capturing correlations between multiple media data embedded in the original space, without considering advanced semantic supervision to learn discriminative binary hash codes. In contrast, supervised cross-media hashing adequately captures the value of the supervision information, leveraging semantic tags to enhance the discriminatory power of the learned hash codes. Therefore, supervised cross-media hashing can often achieve more superior retrieval performance.

For shallow and deep cross-media hash methods, existing cross-media hash methods can be divided into two categories, namely shallow cross-media hash and deep cross-media hash, depending on whether deep learning techniques are utilized or not. The shallow cross-media hash method always adopts a two-stage learning paradigm, namely feature extraction and hash code learning. However, the feature extraction and hash learning processes, which are separate from each other, may cause a problem of mutual inadaptation, i.e., the extracted features are not optimal features for learning hash codes.

In recent years, due to the strong special representation capability and nonlinear modeling capability of the depth network, many depth cross-media hash methods based on multi-layer neural networks are proposed, which generate depth features rich in information by using an end-to-end training manner, and learn high-quality hash codes based on the features. Experience has shown that the end-to-end deep learning architecture is more suitable for learning more discriminative hash codes than traditional shallow cross-media hash methods using manual features.

For single-bit and multi-bit cross-media hash methods, almost all existing cross-media hash methods can only learn single-bit hash codes, i.e. one fixed length hash code, e.g. 32 bits, and when it is required to learn different length hash codes, e.g. 16 bits, 32 bits, 64 bits, etc., these single-bit methods have to modify the dimension of the output layer and then retrain the whole learning network for each high quality hash code generation, which results in a lot of time and resources consumption. To mitigate heterogeneity between different media, most deep cross-media hashing methods maintain pairwise cross-media similarity mainly by using only global embedding. However, using only the global representation is not sufficient to perform discriminant hash learning. Intuitively, a target object that contains important semantics may only occupy a small portion of the original data, which may be easily ignored by global features. Therefore, these frameworks inevitably produce sub-optimal hash codes. Also, the flexibility and scalability of this learning paradigm in practice is very limited, as efficiency and performance must be balanced in a practical system. In general, short hash codes enable faster retrieval but suffer from information loss, whereas long hash codes enable better retrieval performance but are less computationally efficient than short hash codes. Thus, it is extremely difficult to determine the desired length of the hash code to meet performance and storage requirements in real scenes, which inevitably results in a boring adjustment of the code length by retraining the network.

In order to solve the above problems, in an embodiment of the present application, a semantic similarity matrix is constructed by acquiring a training sample set and using semantic tags of training samples in the training sample set, where the training sample set includes a plurality of training batches, and each training batch includes a plurality of image-text pairs; inputting each image-text pair in the training batch into an initial hash model, and determining text global features, text local feature sequences, image global features and image local feature sequences of the training samples through the initial hash model; aligning the text global feature with the image global feature to obtain an aligned text global feature and an aligned image global feature, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-modal contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic features and the image semantic features; determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features; determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features; constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch; and updating parameters of the initial hash model based on the cross-modal comparison loss item and the hash loss item to obtain a multi-bit hash code network model. According to the application, the global features and the local semantic tokens of various media information are obtained by constructing the multi-bit hash code network model, and the heterogeneity and the semantic gap between different media are effectively reduced by respectively carrying out hierarchical alignment on the global features and the local semantic tokens of the various media information. Meanwhile, a plurality of high-quality hash codes with different lengths are generated simultaneously through the global features and the local semantic tokens of all the media information after alignment, so that the efficiency and the accuracy of cross-media retrieval are improved.

The application will be further described by the description of embodiments with reference to the accompanying drawings.

In order to illustrate a specific implementation process of the training method of the multi-bit hash code network model provided by the embodiment of the present application, the multi-bit hash code network model is first described herein. The multi-bit hash network model is used to obtain a plurality of different length hash codes, e.g., 16 bits, 32 bits, 64 bits, etc., of media information.

Specifically, as shown in fig. 1, the multi-bit hash code network model includes a feature extraction module, a cross-media contrast type alignment module and a multi-hash collaborative learning module, where the feature extraction module is connected with the cross-media contrast type alignment module, and the cross-media contrast type alignment module is connected with the multi-hash collaborative learning module, where the feature extraction module is used to extract a text global feature, a text local feature sequence, an image global feature and an image local feature sequence; the cross-media contrast type alignment module is used for determining text fusion features and image fusion features according to the text global features, the image global features, the text local feature sequences and the image local feature sequences; the multi-hash collaborative learning module is used for determining at least two first hash features according to the text fusion features, determining at least two first hash codes based on the at least two first hash features, determining at least two second hash features according to the image fusion features, and determining at least two second hash codes based on the at least two second hash features.

Specifically, the feature extraction module may include a plurality of transform encoders for encoding media information to obtain global features and local features of the media. In the embodiment of the present application, the feature extraction module may include 2 paths of transform encoders, where one path of transform encoder is used to extract a text global feature and a text local feature sequence, and the other path of transform encoder is used to extract an image global feature and an image local feature sequence.

The cross-media contrast alignment module comprises a Residual MLP unit, a local learning unit, a pooling layer and an adder, wherein the Residual MLP unit is connected with the adder, the local learning unit is connected with the adder through the pooling layer, the local learning unit comprises a cross attention layer and a Transformer layer which are sequentially connected, the query vector of the cross attention layer is embedded for sharing concepts, the Residual MLP unit comprises Residual multilayer Perceptron blocks (ResMLP) with shared weights, the Residual multilayer Perceptron blocks are formed by stacking one or more identical blocks, and each block comprises a multilayer Perceptron with Residual connection. In this embodiment, the residual multi-layer perceptron block consists of a stack of 2 identical blocks.

In the embodiment of the application, the cross-media contrast type alignment module comprises two paths of residual MLP units, a local learning unit, a pooling layer and an adder, wherein the two paths of residual MLP units, the local learning unit, the pooling layer and the adder have the same composition structure, one path is used for carrying out data processing on a text global feature and a text local feature sequence to determine a text fusion feature, and the other path is used for carrying out data processing on an image global feature and an image local feature sequence to determine an image fusion feature.

The multi-hash collaborative learning module comprises a hash linear projection layer, wherein the hash linear projection layer comprises a plurality of hash encoders, and the hash encoders are used for encoding the text fusion features and the image fusion features so as to obtain a plurality of hash codes with different lengths and high quality. In the embodiment of the application, the multi-hash collaborative learning module comprises two paths of hash linear projection layers, wherein one path is used for determining at least two first hash codes according to text fusion characteristics, and the other path is used for determining at least two second hash codes according to image fusion characteristics.

As shown in fig. 2, the training method of the multi-bit hash code network model provided in this embodiment specifically includes steps S10-S70.

S10, acquiring a training sample set, and constructing a semantic similarity matrix by utilizing semantic tags of training samples in the training sample set, wherein the training sample set comprises a plurality of training batches, and each training batch comprises a plurality of image-text pairs.

Specifically, the image-text pair includes an image and text having semantic relevance, for example, image a is a truck driving on a street, the truck is in a red-yellow phase, text B is a "truck driving on a street, and image a and text B may be an image-text pair. In this embodiment, the image-text pair o _i may represent: wherein/> Image representing the ith image-text pair,/>The dimensions of the image are represented and,Text representing the ith image-text pair,/>The dimension of the text is represented, l _i∈{0,1}^1×C represents the semantic label of the ith image-text pair, C represents the number of semantic class labels of the training sample set, and the semantic label is a semantic class label marked according to the semantics of the image-text pair, wherein the image-text pair can be a single label sample or a multi-label sample, that is, the semantic label can comprise one semantic class label or a plurality of semantic class labels. In the embodiment of the application, the image-text pair has a plurality of semantic category labels, namely the semantic label of the image-text pair is a multi-label sample.

The training sample set comprises a plurality of training batches, wherein each training batch in the plurality of training batches is used for training the initial hash model as training data of one training cycle of the initial hash model so as to obtain a multi-bit hash code network model. In other words, the image-text pairs are training samples for training the initial hash model. For ease of illustration, in an embodiment of the present application, the training sample set is represented asWhere N represents the number of training samples of the training sample set.

The semantic similarity matrix is constructed according to semantic tags of the image-text pairs and is used for reflecting similarity of each image-text pair of the training sample set in semantic space. Specifically, for N training samples in the training sample set, an n×n semantic similarity matrix S _ij may be generated, where a value of S _ij is 0 or 1, S _ij =1 indicates that the ith image-text pair o _i and the jth image-text pair o _j in the training sample set have at least one same semantic class label, and S _ij =0 indicates that the ith image-text pair o _i and the jth image-text pair o _j in the training sample set do not have the same semantic class label.

S20, inputting each image-text pair in the training batch into an initial hash model, and determining text global features, text local feature sequences, image global features and image local feature sequences of the training samples through the initial hash model.

Specifically, the initial hash model has the same model structure as the multi-bit hash code network model, and only has different model parameters, wherein the initial hash code network model is an initial network model adopted by the initial hash code network model, and the multi-bit hash code network model is a network model obtained by training a training sample set. That is, training the initial hash model by training the sample set may result in a trained multi-bit hash code network model.

The text global feature is a global feature obtained by extracting features of the whole text global information of the text. The text local feature sequence comprises a plurality of text local features, wherein the text local features are local region features obtained by dividing a text into a plurality of phrases and extracting features of each phrase. For example, the text a is divided into 5 phrases, and feature extraction is performed on the 5 phrases respectively, so that 5 text local features can be obtained, and the 5 text local features can form a text local feature sequence of the text.

The image global feature is a global feature obtained by extracting features of global information of the whole image of the image, such as features of shape, texture, color and the like. The image local feature sequence comprises a plurality of image local features, wherein the image local features are local region features obtained by dividing an image into a plurality of image blocks and extracting features of each image block.

In the embodiment of the application, the text is divided into a plurality of word groups, the text is combined with the position codes obtained by coding the text position information in advance, the feature extraction is carried out on the whole text global information through a transducer coder (for example, a GPT-2 coder) of one path of the feature extraction module so as to obtain the text global feature, and the feature extraction is carried out on each word group so as to obtain a plurality of text local features, so that a plurality of text local features are combined into a text local feature sequence.

The method comprises the steps of dividing an image into 3×3 image blocks, combining position codes obtained by encoding image position information in advance, carrying out feature extraction on the whole image global information through another channel of a converter encoder (for example, viT encoder) to obtain image global features, and carrying out feature extraction on each image block to obtain a plurality of image local features, so that the plurality of image local features are combined into an image local feature sequence.

For ease of illustration, in an embodiment of the present application, text features in the ith image-text pair in the training sample setAnd image characteristics/>It can be expressed as:

wherein, And/>Respectively representing a text global feature and an image global feature in an ith image-text pair in a training sample set, R ^1×d represents a feature dimension of the global feature,/> And/>Respectively representing a text local feature sequence and an image local feature sequence in an ith image-text pair in the training sample set, L ^t represents the number of local tokens for the text local feature sequence, L ^v represents the number of local tokens for the image local feature sequence,Feature dimension representing text local feature sequence,/>Feature dimensions of the image local feature sequence.

S30, aligning the text global features and the image global features to obtain aligned text global features and aligned image global features, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-mode contrast loss item based on the aligned text global features, the aligned image global features, the text semantic features and the image semantic features.

Specifically, the aligning the text global feature is mapping the text global feature to the same dimensional spatial representation. The aligning the image global features is mapping the image global features to the same dimensional spatial representation. In this embodiment, the text global feature isInputting one path of residual MLP unit, and mapping the text global feature to the same dimension space representation through a residual multi-layer perceptron block (ResMLP) in one path of residual MLP unit so as to obtain an aligned text global feature; global features/>, of the imageAnd inputting another path of residual MLP unit, and mapping the image global feature to the same dimension space representation through a residual multi-layer perceptron block (ResMLP) shared by weights in the other path of residual MLP to obtain the aligned image global feature. Wherein, the aligned text global feature and the aligned image global feature can be expressed as:

wherein, Representing aligned text global features,/>Representing aligned image global features,/>Representing text global features,/>Representing image global features, resMLP representing mapping functions, θ _res being a trainable weight coefficient, R ^d representing feature dimensions that align text global features or image global features.

The text semantic features are semantic token representations of fine granularity and fixed length that are based on a concept-based perspective, converting text local features in the sequence of text local features. The image semantic features are semantic token representations of fine granularity and fixed length that are based on a concept-based perspective, converting image local features in the image local feature sequence.

In an embodiment of the application, text semantic features and image semantic features are acquired by using a local learning unit. In other words, the local learning unit is operative to selectively aggregate low-level, variable-length text local features or image local features to obtain a fine-grained, fixed-length plurality of text semantic features or a plurality of image semantic features.

Specifically, P shared concept embeddings Q e R ^P×d are preset, P represents the number of shared concept embeddings, d represents the embedding dimension of shared concept embeddings, and the shared concept embeddings Q are input as query vectors and key values of a text local feature sequence into a cross attention layer of a local learning unit of one path to obtain a text attention diagram; the shared concept is embedded into the cross attention layer of a local learning unit of which Q is input as a query vector and a key value of an image local feature sequence to obtain an image attention diagram.

Wherein the text attention map and the image attention map may be expressed as:

wherein, Representing text attention strive,/>Representing an image attention map,/>A sequence of local features of the text is represented,Representing a sequence of image local features, Q representing a shared concept embedding, d representing an embedding dimension of the shared concept embedding Q,Representing a learnable matrix, P representing the number of shared concept embeddings, L ^t representing the number of local tokens of the text local feature sequence, L ^v representing the number of local tokens of the image local feature sequence,/>And/>Representing the dimensions of the text attention map and the image attention map, respectively, softmax represents the Softmax activation function.

Further, from the text attention map and the text local feature sequence, a text coarse semantic representation may be obtained. From the image attention map and the image local feature sequence, a coarse semantic representation of the image may be obtained. Wherein, the text coarse semantics and the image coarse semantics can be expressed as:

wherein, Representing text coarse semantic representation,/>Representing text attention strive,/>Representing text local feature sequences,/>Representing coarse semantic representation of an image,/>Representing an image attention map,/>A sequence of local features of the image is represented,For a learnable matrix, P represents the number of shared concept embeddings, d represents the embedding dimension of the shared concept embeddings, and R ^P×d represents the dimension of the text coarse semantic representation or the image coarse semantic representation.

Inputting the text rough semantic representation into a transducer layer of a local learning unit of one path to obtain text semantic features, and forming a text semantic feature sequence by all the text semantic features; and inputting the image coarse semantic representation into a transducer layer of a local learning unit of the other path to obtain image semantic features, and forming an image semantic feature sequence from all the image semantic features. The transducer layer of the local learning unit comprises 1 transducer encoder block, and the transducer encoder block is used for acquiring fine text semantic features by capturing correlations between text coarse semantic tags and acquiring fine image semantic features by capturing correlations between image coarse semantic tags. Wherein, the text semantic feature sequence and the image semantic feature sequence can be expressed as:

/>

wherein, Representing text semantic feature sequences,/>Representing image semantic feature sequences,/>Representing text coarse semantic representation,/>Representing coarse semantic representation of an image,/>For trainable weight coefficients, P represents the number of shared concept embeddings, d represents the embedding dimension of shared concept embeddings, and R ^P×d represents the dimension of text semantic feature sequences or image semantic feature sequences.

According to the embodiment, through the cross-media contrast type alignment module, the global characteristics of aligned texts and the global characteristics of aligned images are obtained through the residual MLP unit, the global characteristics among different media are aligned hierarchically, redundant characteristic representation is reduced, an important context information is reserved by introducing a cross attention mechanism, text local characteristics of low level and variable length and text semantic characteristics and image local characteristics of fixed length from the image local characteristics to fine granularity are selectively aggregated through a Transformer layer, local semantic tokens among different media are aligned hierarchically, media heterogeneity is effectively bridged, and heterogeneity and semantic gaps among different media are reduced.

In one implementation manner of this embodiment, the construction of the cross-modal contrast loss term based on the aligned text global feature, the aligned image global feature, the text semantic feature, and the image semantic feature specifically includes:

S31, for each image-text pair in a training batch, determining the aligned text global features of the image-text pair And each aligned image global feature/>, in the training batchIs a first similarity of (2); aligned image global features of the image-text pairs/>And each aligned text global feature/>, in the training batchIs a second degree of similarity of (2);

S32, determining a global contrast loss term according to all the determined first similarities and all the determined second similarities;

s33, determining each text semantic feature in the text semantic feature sequence of each image-text pair Image semantic features/>, in sequence of image semantic features with the image-text pairsIs a third similarity of each image semantic feature/>, in the image semantic feature sequence of the image-text pairEach text semantic feature/>, in a sequence of text semantic features with the image-text pairsIs a fourth degree of similarity;

S34, determining a local contrast loss term according to all the determined third similarity and all the determined fourth similarity;

S35, determining a cross-modal contrast loss term according to the global contrast loss term and the local contrast loss term.

Specifically, in step S31, the first similarity and the second similarity are used to determine a similarity of the global features of the image and the text of each image-text pair in the training batch.

For ease of illustration, in embodiments of the present application, where M image-text pairs are included in a training batch, for the ith text in each image-text pair in the training batch, the ith image that is semantically similar to it may be considered a positive sample of contrast learning, and the dissimilar image may be considered a negative sample of contrast learning, i.e., matching images related to text semantics according to text. Similarly, for the ith image, the ith text that is semantically similar to it can be considered as a positive sample of contrast learning, and dissimilar text can be considered as a negative sample of contrast learning, i.e., text related to the image semantics according to image matching.

Based on this, the ith aligned text global feature of the image-text pair can be alignedAnd c-th aligned image global feature/>, in training batchThe input global contrast alignment module performs global contrast alignment to determine a first similarity, and then performs global feature/>, by aligning the ith aligned image of the image-text pairAnd c-th aligned text global feature/>, in training batchThe input global contrast alignment module performs global contrast alignment to determine a second similarity. Wherein the first similarity and the second similarity may be expressed as: /(I)

Wherein,Representing the first similarity,/>Representing a second similarity, τ ₁ is a temperature hyper-parameter, M represents the number of training samples included in the training batch,/>Representing the ith aligned text global feature,/>Representing the c-th aligned image global feature,/>Representing the ith aligned image global feature,/>Representing the c-th aligned text global feature.

In step S32, after the first similarity and the second similarity are obtained, a global contrast loss term may be calculated according to the first similarity and the second similarity, where the global contrast loss term may be determined by summing the first similarity and the second similarity, or may be obtained by weighting the first similarity and the second similarity. In the embodiment of the present application, the global contrast loss term may be expressed as:

wherein, Represents a global contrast loss term, M represents the number of training samples included in the training batch,/>Representing the first similarity,/>Representing a second degree of similarity.

In step S33, the third similarity and the fourth similarity are used to determine similarity of the local features of the image and text for each image-text pair in the training set.

Specifically, the positive sample of contrast learning in the local contrast alignment module consists of the semantic tags corresponding to the images and text in the image-text pairs in the training batch, and the negative sample consists of the semantic tags of the images and the rest of the text in the image-text pairs. For example, the image in the ith image-text pair in the training batch, and the p-th semantic tag of the text in the ith image-text pair may constitute a positive sample, or the text in the ith image-text pair in the training batch, and the p-th semantic tag of the image in the ith image-text pair may constitute a positive sample.

In the embodiment of the application, the text semantic feature sequence contains a plurality of text semantic features, and the p-th text semantic feature in the text semantic feature sequence of the image-text pairJ-th image semantic feature/>, in the image semantic feature sequence, with the image-text pairLocal contrast alignment is carried out, and a third similarity is determined according to contrast learning; the image semantic feature sequence comprises a plurality of image semantic features, and the p-th image semantic feature/>, in the image semantic feature sequence of the image-text pairJ-th text semantic feature in the sequence of text semantic features to the image-text pairAnd carrying out local contrast alignment, and determining a fourth similarity according to contrast learning. Wherein the third similarity and the fourth similarity can be expressed as:

/>

wherein, Representing a third similarity,/>Representing a fourth degree of similarity,/>Representing text semantic feature sequences/>P-th text semantic feature of (1)/>Representing image semantic feature sequences/>P-th image semantic feature of (1)/>Representing image semantic feature sequences/>J-th image semantic feature of (1)/>Representing text semantic feature sequences/>P represents the number of image semantic features or the number of text semantic features, and τ ₂ is a temperature super-parameter.

In step S34, the partial contrast loss term may be determined by summing the third similarity and the fourth similarity, or may be obtained by weighting the third similarity and the fourth similarity. In an embodiment of the present application, the local contrast loss term may be expressed as:

wherein, Representing a local contrast loss term, M representing the number of training samples comprised by the training batch, P representing the number of image semantic features or the number of text semantic features,/>Representing a third similarity,/>A fourth similarity is indicated.

In step S35, the cross-modal contrast loss term may be determined by summing the global contrast loss term and the local contrast loss term, or may be obtained by weighting the global contrast loss term and the local contrast loss term. Wherein the cross-modal contrast loss term may be expressed as:

wherein, Representing cross-modal contrast loss terms,/>Representing global contrast loss term,/>Representing a local contrast loss term, alpha represents a balance super-parameter.

S40, determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features.

Specifically, the text fusion feature is a feature that fuses a global feature and a local feature of a text. The image fusion feature is a feature which fuses the global feature and the local feature of the image. In the embodiment of the application, the text fusion feature can be obtained by inputting the text semantic feature into a pooling layer of one of the text media comparison type alignment modules for pooling operation, and inputting the text semantic feature and the text global feature after pooling operation into an adder of one of the text media comparison type alignment modules. Similarly, the image fusion feature can be obtained by inputting the image semantic feature into a pooling layer of the other path in the image media contrast type alignment module for pooling operation, and inputting the image semantic feature after pooling operation and the aligned image global feature into an adder of the other path in the text media contrast type alignment module. Wherein, the text fusion feature and the image fusion feature can be expressed as:

wherein, Representing text fusion features,/>Representing image fusion features,/>Representing aligned text global features,/>Representing aligned image global features,/>Representing text semantic feature sequences,/>Representing the sequence of image semantic features, GAP represents global averaging pooling.

S50, determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features.

Specifically, the first hash feature is a text hash feature. The second hash feature is an image hash feature. In the embodiment of the application, the text fusion characteristic is input into a hash linear projection layer of one path of a multi-hash collaborative learning module, and the text fusion characteristic is encoded by W hash encoders to obtain W text hash characteristics, namely a first hash characteristic; and inputting the image fusion characteristic into another path of hash linear projection layer, and encoding the image fusion characteristic through W hash encoders to obtain W image hash characteristics, namely a second hash characteristic. Wherein the first hash feature and the second hash feature may be expressed as:

wherein, Representing the first hash feature,/>Representing a second hash feature,/>Representing text fusion features,/>Representing image fusion characteristics, encoder representing a coding function, θ _enc,w being a trainable weight parameter, K _w representing the w-th hash code length,/>Representing the dimension of the w-th hash code.

According to the embodiment, through the multi-hash collaborative learning module, hash features of different lengths of different media can be synchronously learned according to text fusion features and image fusion features, so that hash codes of different lengths of different media can be generated simultaneously, and therefore efficiency and accuracy of cross-media retrieval are improved.

S60, constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch.

Specifically, in order to better maintain the paired semantic similarity between the image and the text in the image-text pair and learn the first hash feature and the second hash feature of the two media of the image and the text, the multi-hash collaborative learning module jointly considers the similarity maintenance of the two media of the image and the text, the similarity maintenance between the media and the unified hash code learning aiming at text hash codes or image hash codes with different lengths. Wherein the similarity of media is maintained, meaning that semantically similar data in each media should be projected very close in hamming space, and semantically dissimilar data should be distant from each other. The similarity maintenance between media refers to the maintenance of semantic similarity between media. The unified hash code learning is to learn a consistent hash code representation among media.

In one implementation manner of this embodiment, the constructing a hash loss term according to at least two first hash features, at least two second hash features, and a semantic similarity matrix specifically includes:

S61, for each image-text pair in a training batch, determining each first hash characteristic of the image-text pair And the first hash feature/>, of each image-text pair in the training setEach second hash feature/>, of the image-text pairSecond hash feature/>, with each image-text pair in the training setIs a second hash feature inner product of (a); determining each first hash feature/>, of the image-text pairSecond hash feature/>, with each image-text pair in the training setIs a third hash feature inner product of the image-text pair, each second hash feature/>First hash feature/>, with each image-text pair in the training setIs a fourth hash feature inner product of (2);

s62, determining a loss item in the media according to the determined inner products of all the first hash features, the determined inner products of all the second hash features and the semantic similarity matrix;

S63, determining a loss item among media according to the determined inner products of all third hash features, the determined inner products of all fourth hash features and the semantic similarity matrix;

S64, for each image-text pair in the training batch, according to each first hash feature And its corresponding second hash feature/>Determining hash feature b _i,w, and determining all hash features b _i,w, all first hash features/>And all second hash features/>Determining a hash quantization item;

s65, determining a hash loss item according to the intra-media loss item, the inter-media loss item and the hash quantization item.

Specifically, in step S61, the first hash inner product is a hash inner product in a text medium. The second hash feature inner product is a hash feature inner product within the image medium.

In order to achieve similarity preservation in text media, hash feature inner products in text, namely first hash feature inner products, can be obtained according to text hash features. In other words, according to the ith first hash feature of the image-text pairAnd the j-th first hash feature/>, of each image-text pair in the training setA first hash feature inner product may be obtained. In order to achieve similarity preservation in image media, hash feature inner products in images, namely second hash feature inner products, can be acquired according to image hash features. In other words, according to the i-th second hash feature/>, of the image-text pairJ second hash feature/>, with each image-text pair in the training setA second hash feature inner product may be obtained. Wherein the first hash feature inner product and the second hash feature inner product may be expressed as:

wherein, Representing the first hash feature inner product,/>Representing a second hash feature inner product,/>Representing the i first hash feature,/>Representing the j-th first hash feature,/>Representing the ith second hash feature,/>Representing the j-th second hash feature.

The third hash feature inner product and the fourth hash feature inner product are feature inner products between an image and text. In the embodiment of the application, in order to achieve similarity maintenance between the image and the text media, the third hash feature inner product may be according to the ith first hash feature of the image-text pairJ second hash feature/>, with each image-text pair in the training setTo determine; the fourth hash feature inner product may be based on an i-th second hash feature/>, of the image-text pairJ first hash feature/>, with each image-text pair in the training setTo determine. Wherein the third hash feature inner product and the fourth hash feature inner product may be expressed as:

Where Θ _i,j,w represents the third hash feature inner product, Φ _i,j,w represents the fourth hash feature inner product, Representing the i first hash feature,/>J-th second hash feature,/>Representing the ith second hash feature,/>The j first hash feature. /(I)

In step S62, the intra-media loss terms include a text intra-media loss term and an image intra-media loss term, where the text intra-media loss term is obtained according to all first hash feature inner products and semantic similarity matrices, and the image intra-media loss term is obtained according to all second hash feature inner products and semantic similarity matrices.

In the embodiment of the application, the text media internal loss item is acquired according to all the first hash characteristic internal areas and the semantic similarity matrix, and the acquisition mode is specifically as follows:

determining asymmetric pair negative log likelihood loss in the text medium according to all the first hash feature inner areas and the semantic similarity matrix;

And determining a loss item in the text medium according to the asymmetric pair negative log likelihood loss in the text medium.

In particular, the asymmetric pair-wise negative log likelihood loss within the text medium may be expressed as:

wherein, Representing asymmetric pairwise negative log likelihood loss in text media, S _ij represents a semantic similarity matrix,/>Representing a first hash feature inner product.

Then, the text in-media loss term can be expressed as:

wherein, Representing lost items in text media,/>Representing asymmetric pair-wise negative log likelihood loss in the text medium, M representing the number of training samples included in the training batch and N representing the number of training samples in the training sample set.

In this embodiment, the image medium internal loss term is obtained according to all second hash feature inner products and semantic similarity matrices, where the obtaining mode specifically includes:

determining asymmetric pair-wise negative log likelihood loss in the image medium according to all second hash feature inner areas and the semantic similarity matrix;

And determining a loss term in the image medium according to the asymmetric pair negative log likelihood loss in the image medium.

In particular, the asymmetric pair-wise negative log likelihood loss within the image medium can be expressed as:

wherein, Representing asymmetric pair-wise negative log likelihood loss in image media, S _ij represents a semantic similarity matrix,/>Representing a second hash feature inner product.

Then, the loss-in-image-medium term can be expressed as:

wherein, Representing loss term in image media,/>Representing asymmetric pair-wise negative log likelihood loss in the image medium, M representing the number of training samples included in the training batch and N representing the number of training samples of the training sample set.

In step S63, a loss term between media is determined according to the determined inner products of all third hash features, the determined inner products of all fourth hash features and the semantic similarity matrix, specifically:

determining asymmetric pair negative log likelihood loss between the text and the image according to all third hash characteristic inner areas and the semantic similarity matrix;

Determining asymmetric pair negative log likelihood loss between the image and the text according to all the fourth hash characteristic inner areas and the semantic similarity matrix;

the medium-to-medium loss term is determined based on the asymmetric pair-wise negative log-likelihood loss between the text-image and the asymmetric pair-wise negative log-likelihood loss between the image-text.

In particular, the asymmetric pair-wise negative log likelihood loss between text-images can be expressed as:

wherein, Representing asymmetric pairwise negative log-likelihood loss between text-images, Θ _i,j,w represents the third hash feature inner product, and S _ij represents the semantic similarity matrix.

The asymmetric pair-wise negative log likelihood loss between the image and text can be expressed as:

wherein, Representing an asymmetric pair-wise negative log-likelihood loss between image-text, Φ _i,j,w represents the fourth feature inner product, and S _ij represents the semantic similarity matrix.

Then, the inter-media loss term can be expressed as:

wherein, Representing loss terms between media,/>Representing asymmetric pairwise negative log likelihood loss between text-images,/>Representing an asymmetric pair-wise negative log likelihood loss between the image and the text, M representing the number of training samples comprised by the training batch and N representing the number of training samples of the training sample set.

In step S64, for purposes of uniform hash code learning, for the ith image-text pair in the training batch, based on each first hash featureAnd its corresponding second hash feature/>To determine a hash feature b _i,w of a w-th unicode of an i-th image-text pair in the training batch, wherein the hash feature b _i,w may be represented as:

Wherein b _i,w denotes a hash feature, Representing the first hash feature and the second hash feature, respectively, of the ith image-text pair in the training batch, sign representing an element-by-element sign function.

Then, according to the determination of all the hash features b _i,w, all the first hash featuresAnd all second hash featuresA hash quantization term may be determined, wherein the hash quantization term may be expressed as:

wherein, Representing a hash quantization term, b _i,w representing a hash feature,/>Respectively representing a first hash feature and a second hash feature of an ith image-text pair in the training batch, K _w representing a w-th hash code length, and M representing the number of training samples included in the training batch.

In step S65, the hash loss term may be determined by summing the text intra-media loss term, the image intra-media loss term, the inter-media loss term, and the hash quantization term. The hash loss term may also be obtained by weighting text intra-media loss terms, image intra-media loss terms, inter-media loss terms, and hash quantization terms. Wherein, the hash loss term may be expressed as:

wherein, Representing hash loss terms,/>Representing lost items in text media,/>Representing loss term in image media,/>Representing loss terms between media,/>Representing hash quantization items, W representing the number of hash codes, and β, γ representing balance super parameters.

And S70, updating parameters of the initial hash model based on the cross-modal comparison loss item and the hash loss item to obtain a multi-bit hash code network model.

Specifically, the multi-bit hash code network model is used for hierarchically aligning global and local features among different media, effectively reducing heterogeneity and semantic difference among different media, and simultaneously generating hash codes with different lengths of a plurality of different media so as to improve efficiency and accuracy of cross-media retrieval.

In an implementation manner of this embodiment, before the updating, based on the cross-modal comparison loss term and the hash loss term, parameters of the initial hash model to obtain a multi-bit hash code network model, the method further includes:

S71, for each image-text pair in the training batch, determining text auxiliary hash features based on the text fusion features, and determining image auxiliary hash features based on the image fusion features;

S72, learning a text hash code according to the text auxiliary hash feature, learning an image hash code according to the image auxiliary hash feature, and determining an auxiliary hash code according to the text hash code and the image hash code;

S73, mapping at least two first hash features and at least two second hash features to a Hamming space where the auxiliary hash features are located to obtain at least two first hash codes and at least two second hash codes;

s74, determining an auxiliary hash loss item according to the text auxiliary hash characteristic and the image auxiliary hash characteristic of the image-text pair in the training batch, and determining a reconstruction loss item according to the auxiliary hash code, the at least two first hash codes and the at least two second hash codes;

s75, determining a correction loss term according to the auxiliary hash loss term and the reconstruction loss term, and correcting the hash loss term based on the correction loss term.

Specifically, in order to further enhance the discrimination of the hash code, the multi-bit hash code network model may further include an auxiliary hash cooperation learning module, where the auxiliary hash cooperation learning module includes an auxiliary hash linear projection layer (AuxHash) configured to obtain text auxiliary hash features and image auxiliary hash features according to the text fusion features and the image fusion features. The auxiliary hash linear projection layer further comprises an online hash learner, wherein the online hash learner is used for respectively learning text hash codes and image hash codes according to the text auxiliary hash characteristics and the image auxiliary hash characteristics so as to determine auxiliary hash codes.

In the embodiment of the application, the auxiliary hash collaborative learning module comprises two paths of auxiliary hash linear projection layers, wherein one path is used for determining the text hash code according to the text fusion characteristics, and the other path is used for determining the image hash code according to the image fusion characteristics.

In step S71, the text auxiliary hash feature may be obtained by mapping the text fusion feature through an auxiliary hash linear projection layer of one path; the image auxiliary hash feature can be obtained by mapping the image fusion feature through an auxiliary hash linear projection layer of the other path. Wherein the text-assisted hash feature and the image-assisted hash feature may be represented as:

wherein, Representing text-assisted hash features,/>Representing image-assisted hash features,/>Representing text fusion features,/>Representing an image fusion feature, auxHash representing a hash function, K _aux representing the length of the hash code,/>Representing the dimension of the auxiliary hash feature, θ _aux represents a trainable parameter.

In step S72, the text-assisted hash feature is generatedThe text hash code with the length of K _aux can be learned by the on-line hash learner which is input to the auxiliary hash linear projection layer of one path. Assist the image with hash featuresThe on-line hash learner input to the auxiliary hash linear projection layer of the other path can learn the image hash code with the length of K _aux. From the text hash code and the image hash code, an auxiliary hash code may be determined, wherein the auxiliary hash code may be expressed as:

Wherein b _i,aux denotes an auxiliary hash code, Representing text-assisted hash features,/>Representing image-assisted hash features, sign represents an element-by-element sign function.

In step S73, the first hash code is a text hash code. The second hash code is an image hash code. In the embodiment of the application, the auxiliary Hash collaborative learning module further comprises an auxiliary Hash space, and the auxiliary Hash space is used for acquiring at least two text Hash codes and at least two image Hash codes, namely at least two first Hash codes and at least two second Hash codes, based on auxiliary Hash characteristics.

Specifically, at least two first hash features are hashed by employing W text hash decodersDecoding, mapping the decoded codes to the auxiliary Hamming space, and obtaining at least two first hash codes; by employing W image hash decoders for at least two second hash features/>Decoding and mapping the decoded data to the auxiliary Hamming space to obtain at least two second hash codes. Wherein the first hash code and the second hash code may be represented as:

wherein, Representing a first hash code,/>Representing a second hash code,/>Representing the first hash feature,/>Representing a second hash feature, decoder representing a decoding function, thetadec _,w being a trainable weight parameter, K _aux representing the length of the hash code,Representing the dimension of the hash code.

In step S74, a specific step of determining an auxiliary hash loss term according to the text auxiliary hash feature and the image auxiliary hash feature of the image-text pair in the training batch is the same as the specific step of constructing the hash loss term according to the at least two first hash features, the at least two second hash features and the semantic similarity matrix. The hash loss item can be determined by summing an auxiliary text medium internal loss item, an auxiliary image medium internal loss item, an auxiliary medium internal loss item and an auxiliary hash quantization item, wherein the specific step of constructing the auxiliary text medium internal loss item is the same as the specific step of constructing the text medium internal loss item, the specific step of constructing the auxiliary image medium internal loss item is the same as the specific step of constructing the image medium internal loss item, the specific step of constructing the auxiliary medium internal loss item is the same as the specific step of constructing the medium internal loss item, and the specific step of constructing the auxiliary hash loss item is the same as the specific step of constructing the hash loss item.

Then, the auxiliary hash loss term may be expressed as:

wherein, Representing auxiliary hash loss term,/>Representing a lost item within the auxiliary text media,Representing loss terms in auxiliary image media,/>Representing auxiliary inter-media loss terms,/>Representing auxiliary hash quantization items, and beta and gamma represent balance super parameters;

The reconstruction penalty term includes a text reconstruction penalty term and an image reconstruction penalty term, wherein the text reconstruction penalty term is a text-based reconstruction penalty and the image penalty term is an image-based reconstruction penalty, and the text reconstruction penalty term and the image reconstruction penalty term are used to determine a correction penalty term. In particular, the text reconstruction loss term may be determined from the auxiliary hash code and the at least two first hash codes, and the image reconstruction loss term is determined from the auxiliary hash code and the at least two second hash codes. Wherein the text reconstruction loss term and the image reconstruction loss term may be expressed as:

wherein, Representing text reconstruction penalty term,/>Representing an image reconstruction loss term, b _i,aux represents an auxiliary hash code,/>Representing a first hash code,/>Representing the second hash code, M representing the number of training samples comprised by the training batch, W representing the number of hash codes.

In step S75, the correction loss term is used to correct the hash loss term. The modified loss term may be a combination of the auxiliary hash loss term and the reconstruction loss term. In the embodiment of the application, the correction loss term can be determined by summing the auxiliary hash loss term and the reconstruction loss term, or can be obtained by weighting the auxiliary hash loss term and the reconstruction loss term. Wherein, the correction loss term may be expressed as:

wherein, Representing a correction loss term,/>Representing auxiliary hash loss term,/>Representing text reconstruction penalty term,/>Representing the image reconstruction loss term, delta represents the balance super-parameter.

In this embodiment, a loss function of the multi-bit hash code network model may be constructed based on the correction loss term, the hash loss term, and the cross-modal contrast loss term. The loss function of the multi-bit hash code network model can be determined by summing the correction loss term, the hash loss term and the cross-modal comparison loss term, and can also be obtained by weighting the correction loss term, the hash loss term and the cross-modal comparison loss term. Wherein, the loss function of the multi-bit hash code network model can be expressed as:

wherein, Loss function representing a multi-bit hash code network model,/>Representing a correction loss term,/>Representing cross-modal contrast loss terms,/>The hash loss term is represented, and λ, μ represent the balance super-parameters.

According to the embodiment, the correction loss item is obtained through the auxiliary hash collaborative learning module so as to correct the hash loss item, so that the quality of generating a plurality of hash codes with different lengths based on the multi-bit hash code network model can be further improved, and a plurality of hash codes with different lengths and high quality of different media are generated so as to improve the accuracy of cross-media retrieval.

In summary, the present embodiment provides a training method for a multi-bit hash code network model, where the method specifically includes: acquiring a training sample set, and constructing a semantic similarity matrix by utilizing semantic tags of training samples in the training sample set, wherein the training sample set comprises a plurality of training batches, and each training batch comprises a plurality of image-text pairs; inputting each image-text pair in the training batch into an initial hash model, and determining text global features, text local feature sequences, image global features and image local feature sequences of the training samples through the initial hash model; aligning the text global feature with the image global feature to obtain an aligned text global feature and an aligned image global feature, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-modal contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic features and the image semantic features; determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features; determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features; constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch; and updating parameters of the initial hash model based on the cross-modal comparison loss item and the hash loss item to obtain a multi-bit hash code network model. According to the application, the global features and the local semantic tokens of various media information are obtained by constructing the multi-bit hash code network model, and the heterogeneity and the semantic gap between different media are effectively reduced by respectively carrying out hierarchical alignment on the global features and the local semantic tokens of the various media information. Meanwhile, a plurality of high-quality hash codes with different lengths are generated simultaneously through the global features and the local semantic tokens of all the media information after alignment, so that the efficiency and the accuracy of cross-media retrieval are improved.

The embodiment also provides a cross-media retrieval method based on the multi-bit hash code, wherein the cross-media retrieval method based on the multi-bit hash code specifically comprises the following steps:

Specifically, the query sample is an image or text, and the image or text is input into the multi-bit hash code network model to obtain a plurality of image hash code sequences or text hash code sequences with different lengths. The target data set comprises a plurality of target data, wherein the target data are image-text pairs, and the target data in the target data are input into the multi-bit hash code network model to obtain a plurality of target image hash code sequences or target text hash code sequences with different lengths so as to form a hash code sequence set. And carrying out matching query on the image hash code sequence or the text hash code sequence and the hash code sequence set so as to obtain target data with semantically similar query sample. For example, if the query sample is text, an image related to the text semantic needs to be queried, and the acquired text hash code sequence and the hash code sequence set are matched and queried to obtain an image similar to the text semantic.

Based on the training method of the multi-bit hash code network model, the embodiment provides a training device of the multi-bit hash code network model, as shown in fig. 3, where the device specifically includes:

An obtaining module 100, configured to obtain a training sample set, and construct a semantic similarity matrix by using semantic labels of training samples in the training sample set, where the training sample set includes a plurality of training batches, and each training batch includes a plurality of image-text pairs;

The initial hash model 200 is used for inputting each image-text pair in the training batch into the initial hash model, and determining the text global feature, the text local feature sequence, the image global feature and the image local feature sequence of the training sample through the initial hash model; aligning the text global feature with the image global feature to obtain an aligned text global feature and an aligned image global feature, embedding text semantic features of a learning text local feature sequence and image semantic features of an image local feature sequence according to a sharing concept, and constructing a cross-modal contrast loss item based on the aligned text global feature, the aligned image global feature, the text semantic features and the image semantic features; determining text fusion features according to the aligned text global features and the text semantic features, and determining image fusion features according to the aligned image global features and the image semantic features; determining at least two first hash features according to the text fusion features, and determining at least two second hash features according to the image fusion features; constructing a hash loss term according to the semantic similarity matrix and at least two first hash features and at least two second hash features of each image-text pair in the training batch;

And the parameter updating module 300 is configured to update parameters of the initial hash model based on the cross-modal comparison loss term and the hash loss term, so as to obtain a multi-bit hash code network model.

Based on the above-mentioned training method of the multi-bit hash code network model, the present embodiment provides a computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the training method of the multi-bit hash code network model as described in the above-mentioned embodiment.

Based on the training method of the multi-bit hash code network model, the application also provides a terminal device, as shown in fig. 4, which comprises at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The training method of the multi-bit hash code network model is characterized by specifically comprising the following steps of:

2. The training method of the multi-bit hash code network model according to claim 1, wherein the multi-bit hash code network model comprises a feature extraction module, a cross-media contrast type alignment module and a multi-hash collaborative learning module, the feature extraction module is connected with the cross-media contrast type alignment module, the cross-media contrast type alignment module is connected with the multi-hash collaborative learning module, and the feature extraction module is used for extracting text global features, text local feature sequences, image global features and image local feature sequences; the cross-media contrast type alignment module is used for determining text fusion features and image fusion features according to the text global features, the image global features, the text local feature sequences and the image local feature sequences; the multi-hash collaborative learning module is used for determining at least two first hash features according to the text fusion features, determining at least two first hash codes based on the at least two first hash features, determining at least two second hash features according to the image fusion features, and determining at least two second hash codes based on the at least two second hash features.

3. The training method of the multi-bit hash code network model according to claim 2, wherein the cross-media contrast alignment module comprises a residual MLP unit, a local learning unit, a pooling layer and an adder, the residual MLP unit is connected with the adder, the local learning unit is connected with the adder through the pooling layer, the local learning unit comprises a cross attention layer and a Transformer layer which are sequentially connected, and query vectors of the cross attention layer are embedded for sharing concepts.

4. The training method of the multi-bit hash code network model according to claim 1, wherein the construction of the cross-modal contrast loss term based on the aligned text global feature, the aligned image global feature, the text semantic feature and the image semantic feature is specifically as follows:

5. The method for training a multi-bit hash code network model according to claim 1, wherein constructing a hash loss term according to at least two first hash features, at least two second hash features, and a semantic similarity matrix specifically comprises:

6. The method for training a multi-bit hash code network model according to claim 1, wherein before updating parameters of the initial hash model based on the cross-modal contrast loss term and the hash loss term to obtain the multi-bit hash code network model, the method further comprises:

7. The cross-media retrieval method based on the multi-bit hash code is characterized by comprising the following steps of:

acquiring a query sample, and determining at least a query hash code sequence of the query sample through a multi-bit hash code network model, wherein the query hash code sequence at least comprises two hash codes;

8. The device for training the multi-bit hash code network model is characterized by comprising the following specific components:

9. A computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a method of training a multi-bit hash code network model as claimed in any one of claims 1 to 6 and/or to implement steps in a multi-bit hash code based cross-media retrieval method as claimed in claim 7.

10. A terminal device, comprising: a processor and a memory;

The processor, when executing the computer readable program, implements the steps in the training method of the multi-bit hash code network model according to any of claims 1-6 and/or implements the steps in the multi-bit hash code based cross-media retrieval method according to claim 7.