CN114691897A

CN114691897A - Depth self-adaptive multi-mode Hash retrieval method and related equipment

Info

Publication number: CN114691897A
Application number: CN202210284064.3A
Authority: CN
Inventors: 张正; 安峻枫; 卢光明
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-07-01

Abstract

The invention discloses a deep self-adaptive multi-modal Hash retrieval method and related equipment, wherein in the process of the Hash learning facing multi-modal data, a feature learning network of respective modal data is designed according to the physical characteristics and the characteristics of each modal data, the learnable weight is determined for each modal characteristic according to the contribution of each modal in a training sample which is input for learning each time to the performance of the final common characteristic, the features of each modal are fused according to the weight, and the information fusion of the self-adaptive weight is completed according to the characteristics of the training sample; the difference between the fused common features and the Hash codes is minimized, the scalable semantic features extracted from the preset labels are added in the process, the parameters of the Hash function are automatically updated, the alignment of the feature space and the Hash space is realized, the parameter updating is supervised by using the label semantic information, and the multi-modal feature self-adaptive fusion capability and the discriminability representation capability of Hash learning can be improved.

Description

Depth self-adaptive multi-mode Hash retrieval method and related equipment

Technical Field

The invention relates to the technical field of hash retrieval, in particular to a depth self-adaptive multi-mode hash retrieval method and related equipment.

Background

With the rapid development of information technology, the representation forms of multimedia data are more and more diversified, including images, texts, audio and the like. The multi-modal hash search is a search in which data having a plurality of modalities is encoded into a compact binary code. Before Hash retrieval, Hash learning is needed, and in the existing multi-mode Hash learning method, data information of multiple modes cannot be effectively and adaptively fused, so that the Hash learning efficiency is influenced.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

In view of the above defects in the prior art, the present invention provides a depth adaptive multi-modal hash retrieval method and related devices, and aims to solve the problem that in the prior art, multi-modal hash retrieval cannot adaptively fuse data information of multiple modalities.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

in a first aspect of the present invention, a depth adaptive multi-modal hash retrieval method is provided, where the method includes:

selecting a plurality of target training samples from a plurality of training samples to form a target training batch, wherein each training sample comprises data of at least one mode, determining a first feature extraction network set corresponding to the target training samples according to the modes of the data in the target training samples, the first feature extraction network set comprises first feature extraction networks respectively corresponding to the modes in the target training samples, and acquiring initial features of the modes in the target training samples through the first feature extraction networks;

inputting the initial features of each mode in the target training sample into a weight extraction network, obtaining the weight of each mode in the target training sample output by the weight extraction network, and fusing the initial features of each mode in the target training sample according to the weight corresponding to each mode to obtain the fusion features corresponding to the target training sample;

inputting the fusion characteristics of the target training sample into a first hash network, acquiring a sample hash code output by the first hash network, inputting a semantic label corresponding to the target training sample into a second hash network, and acquiring a semantic hash code output by the second hash network;

acquiring training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample, and updating parameters of the first hash network according to the training loss of the target training batch;

and re-executing the step of selecting a plurality of target training samples from the plurality of training samples to form a target training batch until the parameters of the first Hash network are converged, and acquiring the Hash code corresponding to the sample to be retrieved by adopting the first Hash network after the parameters are converged.

Wherein, the weight extraction network includes a feature extraction layer and a weight output layer, the inputting the initial features of each modality in the target training sample to the weight extraction network, and obtaining the weight of each modality in the target training sample output by the weight extraction network includes:

inputting the initial features of each mode in the target training sample to the feature extraction layer, and acquiring potential consistent features corresponding to each mode in the target training sample;

inputting potential consistent features corresponding to each mode in the target training sample to the weight output layer to obtain the weight of each mode in the target training sample;

the fusing the initial features of each modality in the target training sample according to the weight corresponding to each modality includes:

and fusing the potential consistency of each mode in the target training sample according to the weight corresponding to each mode to obtain the fusion characteristic corresponding to the target training sample.

The method for deep adaptive multi-modal hash retrieval, wherein the obtaining of the training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample comprises:

obtaining a first loss of the target training batch according to a difference between the sample hash code and the fusion feature of each target training sample;

obtaining a second loss of the target training batch according to a difference between the sample hash code and the semantic hash code of each target training sample;

obtaining a third loss of the target training batch according to the sample hash code of each target training sample and the semantic similarity between each target training sample;

and acquiring the training loss of the target training batch according to the first loss, the second loss and the third loss.

The method for deep adaptive multi-modal hash retrieval, wherein before obtaining the third loss of the target training batch according to the semantic similarity between the semantic hash code of each target training sample and each target training sample, the method further comprises:

and obtaining semantic similarity between the target training samples according to the semantic label corresponding to each target training sample.

The method for deep adaptive multi-modal hash retrieval, wherein the updating the parameters of the first hash network according to the training loss of the target training batch comprises:

and updating parameters of the first hash network, the second hash network and the weight extraction network according to the loss of the target training sample.

The method for deep adaptive multi-modal hash retrieval, wherein the selecting a plurality of target training samples from the plurality of training samples to form a target training batch comprises:

acquiring fusion characteristic loss of each training sample according to the fusion characteristics respectively corresponding to the training samples under the current network parameters and the initial characteristics respectively corresponding to the training samples;

and sequencing according to the fusion characteristic losses corresponding to the training samples respectively, and selecting the target training samples according to a sequencing result.

The depth adaptive multi-modal hash retrieval method, wherein the obtaining of the hash code corresponding to the sample to be retrieved by using the first hash network after parameter convergence, comprises:

acquiring the fusion characteristics corresponding to the sample to be retrieved, inputting the fusion characteristics of the sample to be retrieved into the first hash network after parameter convergence, and acquiring a hash code corresponding to the sample to be retrieved output by the first hash network.

In a second aspect of the present invention, there is provided a depth adaptive multi-modal hash retrieval apparatus, including:

the initial feature extraction module is used for selecting a plurality of target training samples from a plurality of training samples to form a target training batch, each training sample comprises data of at least one mode, a first feature extraction network set corresponding to the target training samples is determined according to the mode of the data in the target training samples, the first feature extraction network set comprises first feature extraction networks respectively corresponding to the modes in the target training samples, and the initial features of the modes in the target training samples are obtained through the first feature extraction networks;

the fusion feature extraction module is used for acquiring the weight of each modality in the target training sample according to the initial feature of each modality in the target training sample, and fusing the initial feature of each modality in the target training sample according to the weight corresponding to each modality to obtain the fusion feature corresponding to the target training sample;

the hash module is used for inputting the fusion characteristics of the target training sample into a first hash network to obtain a sample hash code output by the first hash network, and inputting the semantic label corresponding to the target training sample into a second hash network to obtain a semantic hash code output by the second hash network;

a parameter updating module, configured to obtain a training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample, and update a parameter of the first hash network according to the training loss of the target training batch;

an iteration module, configured to re-execute the step of selecting a plurality of target training samples from the plurality of training samples to form a target training batch until the parameters of the first hash network converge;

and the retrieval module is used for acquiring the hash code corresponding to the sample to be retrieved by adopting the first hash network after the parameter convergence.

In a third aspect of the present invention, a terminal is provided, where the terminal includes a processor, and a computer-readable storage medium communicatively connected to the processor, the computer-readable storage medium is adapted to store a plurality of instructions, and the processor is adapted to call the instructions in the computer-readable storage medium to execute the steps of implementing any one of the foregoing depth-adaptive multi-modal hash retrieval methods.

In a fourth aspect of the present invention, a computer readable storage medium is provided, which stores one or more programs, which are executable by one or more processors to implement the steps of the depth-adaptive multi-modal hash retrieval method described in any one of the above.

Compared with the prior art, the invention provides a deep self-adaptive multi-mode Hash retrieval method and related equipment, in the Hash learning process, the initial characteristics of different modes are extracted according to the neural networks suitable for different modes, then the neural networks are adopted to output the weight of each mode according to the initial characteristics of each mode, then fusion is carried out to obtain the fusion characteristics of the data of each mode, the updating of the self-adaptive weight and the complementary fusion of multi-mode content are realized, after the fusion characteristics are obtained, the fusion characteristics are input into the first Hash network to obtain the characteristic Hash codes, the semantic labels are also converted into binary codes (semantic Hash codes), the training loss is obtained according to the characteristic Hash codes and the semantic Hash codes, the method adopts the end-to-end deep learning neural networks to carry out Hash learning, the semantic supervision is introduced in the Hash learning process, the efficiency of Hash learning is improved, and the finally obtained Hash code has discriminability and effectiveness.

Drawings

FIG. 1 is a flowchart of an embodiment of a depth adaptive multi-modal hash retrieval method provided by the present invention;

FIG. 2 is a schematic diagram of a learning framework in an embodiment of a depth adaptive multi-modal hash retrieval method provided by the present invention;

fig. 3 is a schematic structural diagram of an embodiment of a depth adaptive multi-modal hash retrieval apparatus provided in the present invention;

fig. 4 is a schematic diagram of an embodiment of a terminal according to the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The depth adaptive multi-mode hash retrieval method provided by the invention can be applied to a terminal with computing capability, the terminal can execute the depth adaptive multi-mode hash retrieval method provided by the invention to determine multi-mode hash retrieval parameters, and the terminal can be but is not limited to various computers, mobile terminals, intelligent household appliances, wearable equipment and the like.

Example one

As shown in fig. 1, in an embodiment of the depth adaptive multi-modal hash retrieval method, the method includes the steps of:

s100, selecting a plurality of target training samples from a plurality of training samples to form a target training batch, wherein each training sample comprises data of at least one mode, determining a first feature extraction network set corresponding to the target training samples according to the modes of the data in the target training samples, the first feature extraction network set comprises first feature extraction networks respectively corresponding to the modes in the target training samples, and acquiring initial features of the modes in the target training samples through the first feature extraction networks.

In this embodiment, one training batch is adopted to complete one training, that is, the network parameters are updated once according to the operation results of all training samples in one training batch. Each training sample is multi-modal data, that is, each training sample includes data of multiple modalities, specifically, each information source may be referred to as a modality, and the multi-modal data includes data of multiple information sources, such as video, audio, text, image, and the like. When processing multi-modal data, feature extraction is performed first. For data of different modalities, different feature extraction networks may be used for feature extraction to obtain the initial features corresponding to the data of each modality, for example, for image data, CNN network may be used to extract features, for video data, C3D network may be used to extract features, and for text data, BERT network may be used to extract initial features.

The modalities included in different training samples may be different, and for each target training sample, each first feature extraction network corresponding to the target training sample is determined according to the modality of data in the target training sample. And then, performing feature extraction on the data of the corresponding mode in the target training sample according to each first feature extraction network corresponding to the target training sample to obtain the initial feature of each mode in the target training sample.

S200, inputting the initial features of each mode in the target training sample into a weight extraction network, obtaining the weight of each mode in the target training sample output by the weight extraction network, and fusing the initial features of each mode in the target training sample according to the weight corresponding to each mode to obtain the fusion features corresponding to the target training sample.

The initial features of each modality in the target training sample are further processed in the weight extraction network, specifically, the weight extraction network may include a feature extraction layer and a weight output layer, as shown in fig. 2, the initial features of each modality in the target training sample are further feature-extracted through the feature extraction layer, which may be further regarded as projecting the initial features of each modality in the target training sample to a potentially consistent space to obtain potentially consistent features, the potentially consistent features corresponding to each modality respectively reflect feature expressions of each modality in the potentially consistent space, and after the potentially consistent features are input to the weight output layer, the weight output layer outputs weights corresponding to each modality. The parameters of the weight output layer are determined according to the performance contribution of the initial features of each modality to the fused features obtained after fusion to the finally generated hash code, namely in the training process, the parameters of the weight output layer are updated according to the training loss, so that the training loss is as small as possible. Inputting the initial features of each modality in the target training sample into a weight extraction network, and obtaining the weight of each modality in the target training sample output by the weight extraction network, including:

and inputting the potential consistent characteristics corresponding to each mode in the target training sample to the weight output layer to obtain the weight of each mode in the target training sample.

As can be seen from the foregoing description, the weight corresponding to each modality is adaptively adjusted according to the characteristics of the data of each modality, so that feature complementarity fusion can be achieved according to the actual situation of the data of each modality in the sample.

After the weights corresponding to the modalities output by the weight output layer are obtained, the initial features of the modalities in the target training sample are fused according to the weight corresponding to each modality, and the method comprises the following steps:

S300, inputting the fusion characteristics of the target training sample into a first hash network, acquiring a sample hash code output by the first hash network, inputting a semantic label corresponding to the target training sample into a second hash network, and acquiring a semantic hash code output by the second hash network.

In order to ensure that the generated hash code has discriminability and validity, in this embodiment, semantic supervision is introduced, specifically, for each training sample, a semantic label is corresponding, the semantic label reflects the semantic category of the training sample, the semantic label may be a vector composed of 0 and 1, and each value in the vector reflects whether the training sample belongs to a certain semantic category.

After the fusion features of the target training samples are obtained, inputting the fusion features of the target training samples into a first hash network, and processing the fusion features of the target training samples in the first hash network through a hash function to obtain sample hash codes output by the first hash network. And flexibly converting the semantic label corresponding to the target training sample, specifically, inputting the semantic label into a second hash network, and acquiring a semantic hash code output by the second hash network, that is, converting the semantic label corresponding to the target training sample into a binary code, wherein the length of the binary code may be any length.

The obtaining of the training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample includes:

In order to make the hash codes of the samples have validity and discriminability, the difference between the sample hash codes of the same sample and the fusion features calculated through the above steps should be made as small as possible, so that more original information can be retained, and the difference between the sample hash codes of the same sample and the semantic hash codes should be made as small as possible, so that the hash codes can accurately reflect semantics, and the sample hash codes of samples with more similar semantic labels should be more similar, and the sample hash codes of samples with less similar semantic labels should be less similar, so that the hash codes have discriminability.

That is, before obtaining the third loss of the target training batch according to the semantic similarity between the semantic hash code of each target training sample and each target training sample, the method further includes:

Specifically, the third loss of the target training batch is obtained according to the sample hash codes of the target training samples and the semantic similarity between the target training samples, which may be calculating the difference between the sample hash codes and the semantic similarity for each pair of target training samples in the target training batch, measuring the distance between the difference between the sample hash codes and the semantic similarity to obtain a sample pair loss, and summing up all the sample pair losses in the target training batch to obtain the third loss.

When the training loss of the target training batch is obtained according to the first loss, the second loss, and the third loss, the first loss, the second loss, and the third loss may be given corresponding weights and then weighted and summed to obtain the training loss of the target training batch.

Referring to fig. 1 again, the method provided in this embodiment further includes the steps of:

s400, obtaining the training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample, and updating the parameters of the first hash network according to the training loss of the target training batch.

In the depth adaptive multi-modal hash retrieval method provided by this embodiment, an end-to-end training mode is adopted, and network parameters are updated according to the training loss of the target training batch. In a possible implementation manner, only the parameters of the first hash network may be updated, and the parameters of other networks may be fixed, and in order to accelerate the training process and improve the training effect, the parameters of the weight extraction network and the second hash network may also be learnable, that is, the updating the parameters of the first hash network according to the training loss of the target training batch includes:

Further, the parameters of the first feature extraction network may also be learnable, i.e. updated during training together with the parameters of other networks.

s500, the step of selecting a plurality of target training samples from the plurality of training samples to form a target training batch is executed again until the parameters of the first Hash network are converged, and the Hash code corresponding to the sample to be retrieved is obtained by adopting the first Hash network after the parameters are converged.

And after the network parameters are updated once, reselecting the training batch for the next training, namely selecting a new target training batch, updating the network parameters according to the steps, iterating in the above way, and finishing the training after the network parameters are converged.

In this embodiment, in order to improve training efficiency, when selecting the target training sample from the plurality of training samples, the target training sample is not randomly selected but selected according to an information entropy corresponding to each training sample, specifically, the selecting a plurality of target training samples from the plurality of training samples to form a target training batch includes:

The fusion feature loss corresponding to the target training sample reflects the difference between the fusion feature obtained after the initial features corresponding to the modalities of the target training sample are fused and the initial feature, the fusion feature loss can be calculated by solving the difference, distance, regularization loss and the like between the initial features and the fusion feature, the parameters of the neural network are updated after each training is finished, for the training samples which are not selected as the target training batch in the current training, the corresponding initial features and the corresponding fusion features are obtained according to the processing methods in the steps S100-S200 respectively, and then the fusion feature loss corresponding to each training sample is obtained according to the initial features and the fusion features corresponding to each training sample. And then sequencing the fusion losses corresponding to the training samples, and selecting a plurality of target training samples in the new training as the target training batches in the new training according to the sequencing result.

The fusion characteristic loss corresponding to the training samples reflects the learning difficulty degree of the training samples, the larger the fusion characteristic loss is, the larger the learning difficulty is, the larger the information entropy is, otherwise, the smaller the fusion characteristic loss is, the fusion characteristic loss corresponding to the training samples which are not selected as the target training samples in the plurality of training samples under the current network parameters is ranked, and the first n training samples with the minimum current fusion characteristic loss can be used as the target training samples in the next training. The effect of carrying out sample study according to the order from easy to difficult can be realized, the efficiency of study is effectively promoted.

The obtaining of the hash code corresponding to the sample to be retrieved by using the first hash network after the parameter convergence includes:

acquiring the fusion characteristics corresponding to the sample to be retrieved, inputting the fusion characteristics of the sample to be retrieved into the first hash network after parameter convergence, and acquiring a hash code corresponding to the sample to be retrieved output by the first hash network. And after the network parameters are converged, finishing training, and acquiring the hash code corresponding to the sample to be retrieved based on the trained network parameters to realize hash retrieval. Specifically, after the training is completed, the second hash network is not used in the process of obtaining the hash code corresponding to the sample to be retrieved. The specific process is as follows: extracting initial features by adopting the corresponding first feature extraction network according to the modalities of the data in the sample to be retrieved, then obtaining the weight of each modality through the weight extraction network, performing feature fusion based on the weight to obtain the fusion features corresponding to the sample to be retrieved, and inputting the fusion features corresponding to the sample to be retrieved to the first Hash network with parameter convergence to obtain the Hash code of the sample to be retrieved.

To sum up, this embodiment provides a deep adaptive multi-modal hash retrieval method, in a hash learning process, extracting initial features of different modalities according to neural networks applicable to different modalities, then outputting weights of the modalities according to the initial features of the modalities by using the neural networks, and then performing fusion to obtain fusion features of data of the modalities, so as to implement adaptive weight update and complementary fusion of multi-modal content, after obtaining the fusion features, inputting the fusion features into a first hash network to obtain a feature hash code, converting a semantic tag into a binary code (semantic hash code), and obtaining training loss according to the feature hash code and the semantic hash code, in the method, an end-to-end deep learning neural network is used for hash learning, semantic supervision is introduced in the hash learning process, so as to improve the efficiency of hash learning, and the finally obtained hash code has discriminability and validity.

It should be understood that, although the steps in the flowcharts shown in the figures of the present specification are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Example two

Based on the foregoing embodiment, the present invention further provides a depth adaptive multi-modal hash retrieval apparatus, as shown in fig. 3, where the depth adaptive multi-modal hash retrieval apparatus includes:

an initial feature extraction module, configured to select multiple target training samples from multiple training samples to form a target training batch, where each training sample includes data of at least one modality, determine a first feature extraction network set corresponding to the target training sample according to the modality of the data in the target training sample, where the first feature extraction network set includes first feature extraction networks respectively corresponding to the modalities in the target training sample, and obtain initial features of the modalities in the target training sample through the first feature extraction networks, as specifically described in embodiment one;

a fusion feature extraction module, configured to obtain a weight of each modality in the target training sample according to an initial feature of each modality in the target training sample, and fuse the initial feature of each modality in the target training sample according to a weight corresponding to each modality to obtain a fusion feature corresponding to the target training sample, which is specifically described in embodiment one;

a hash module, configured to input the fusion feature of the target training sample to a first hash network, obtain a sample hash code output by the first hash network, input a semantic tag corresponding to the target training sample to a second hash network, and obtain a semantic hash code output by the second hash network, where the hash module is specifically described in embodiment one;

a parameter updating module, configured to obtain a training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample, and update a parameter of the first hash network according to the training loss of the target training batch, as described in embodiment one;

an iteration module, configured to re-execute the step of selecting a plurality of target training samples from the plurality of training samples to form a target training batch until a parameter of the first hash network converges, as described in embodiment one;

the retrieval module is configured to obtain a hash code corresponding to a sample to be retrieved by using the first hash network after parameter convergence, which is specifically described in embodiment one.

EXAMPLE III

Based on the above embodiments, the present invention further provides a terminal, as shown in fig. 4, where the terminal includes a processor 10 and a memory 20. Fig. 4 shows only some of the components of the terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may also be an external storage device of the terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various data. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a deep adaptive multi-modal hash retrieval program 30, and the deep adaptive multi-modal hash retrieval program 30 can be executed by the processor 10 to implement the deep adaptive multi-modal hash retrieval method of the present application.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), microprocessor or other chip for running program code stored in the memory 20 or Processing data, such as executing the deep adaptive multi-modal hash retrieval method.

In one embodiment, the following steps are implemented when the processor 10 executes the depth adaptive multi-modal hash retrieval program 30 in the memory 20:

Wherein the obtaining of the training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample comprises:

Before obtaining a third loss of the target training batch according to the semantic hash code of each target training sample and the semantic similarity between the target training samples, the method further includes:

Wherein the updating the parameters of the first hash network according to the training loss of the target training batch comprises:

Wherein the selecting a plurality of target training samples from the plurality of training samples to form a target training batch comprises:

acquiring fusion feature loss of each training sample according to the fusion features respectively corresponding to the training samples under the current network parameters and the initial features respectively corresponding to the training samples;

Example four

The present invention also provides a computer readable storage medium having stored thereon one or more programs, the one or more programs being executable by one or more processors to perform the steps of the depth adaptive multi-modal hash retrieval method as described above.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A depth-adaptive multi-modal hash retrieval method, the method comprising:

selecting a plurality of target training samples from a plurality of training samples to form a target training batch, wherein each training sample comprises data of at least one mode, determining a first feature extraction network set corresponding to the target training sample according to the mode of the data in the target training sample, the first feature extraction network set comprises first feature extraction networks respectively corresponding to all the modes in the target training sample, and acquiring initial features of all the modes in the target training sample through all the first feature extraction networks;

2. The method according to claim 1, wherein the weight extraction network comprises a feature extraction layer and a weight output layer, and the inputting the initial features of each modality in the target training sample into the weight extraction network and obtaining the weight of each modality in the target training sample output by the weight extraction network comprises:

3. The method according to claim 1, wherein the obtaining the training loss of the target training batch according to the sample hash code and the semantic hash code of each target training sample comprises:

4. The method of claim 1, wherein before obtaining the third loss of the target training batch according to the semantic similarity between the semantic hash code of each of the target training samples and each of the target training samples, the method further comprises:

5. The method according to claim 1, wherein the updating the parameters of the first hash network according to the training loss of the target training batch comprises:

6. The method according to claim 1, wherein the selecting a plurality of target training samples from the plurality of training samples to form a target training batch comprises:

7. The method according to claim 1, wherein the obtaining the hash code corresponding to the sample to be retrieved by using the first hash network after parameter convergence comprises:

8. A depth adaptive multi-modal hash retrieval apparatus, comprising:

the hash module is used for inputting the fusion characteristics of the target training sample into a first hash network to obtain a sample hash code output by the first hash network, and inputting a semantic label corresponding to the target training sample into a second hash network to obtain a semantic hash code output by the second hash network;

9. A terminal, characterized in that the terminal comprises: a processor, a computer readable storage medium communicatively coupled to the processor, the computer readable storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the computer readable storage medium to perform the steps of implementing the method for deep adaptive multi-modal hash retrieval as recited in any of claims 1-7 above.

10. A computer readable storage medium, storing one or more programs, the one or more programs being executable by one or more processors for performing the steps of the method for depth adaptive multi-modal hash retrieval as recited in any one of claims 1-7.