CN112364198A

CN112364198A - Cross-modal Hash retrieval method, terminal device and storage medium

Info

Publication number: CN112364198A
Application number: CN202011289807.3A
Authority: CN
Inventors: 曹文明; 冯文铄; 曹桂涛
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-12
Anticipated expiration: 2040-11-17
Also published as: CN112364198B

Abstract

The invention is suitable for the technical field of cross-modal Hash retrieval, and provides a cross-modal Hash retrieval method, terminal equipment and a storage medium.

Description

Cross-modal Hash retrieval method, terminal device and storage medium

Technical Field

The invention belongs to the technical field of cross-modal hash retrieval, and particularly relates to a cross-modal hash retrieval method, terminal equipment and a storage medium.

Background

The cross-modal hash retrieval method is one of the mainstream methods for realizing quick and effective retrieval in multi-modal data at present. The deep cross-modal hash retrieval method which combines feature learning and hash learning by using the deep neural network also shows better performance than the traditional cross-modal hash retrieval method. However, the mutual deviation between the dense parameter update of the deep neural network and the sparse index characteristic of the hash code aggravates the problem that the information richness of the existing synonymous heterogeneous multi-modal data in the original space and the hamming space is asymmetric, and causes great obstacle to the improvement of the stability and the performance of the retrieval algorithm.

Disclosure of Invention

In view of this, embodiments of the present invention provide a cross-modal hash retrieval method, a terminal device, and a storage medium, which can implement organic join and efficient multiplexing of a feature learning process and a hash learning process.

A first aspect of an embodiment of the present invention provides a cross-modal hash retrieval method, including:

generating a real image hash code of the real image data and a real text hash code of the real text data through a deep neural network;

acquiring similarity measurement according to a common label between the real image data and the real text data;

performing inter-modal loss optimization and intra-modal loss optimization according to the real image hash code, the real text hash code and the similarity measure to update parameters of the deep neural network and generate a public hash code;

inputting a new real image hash code generated by the deep neural network after the parameters are updated into a first generator, inputting the real text hash code into a first discriminator, and updating the network parameters of the first generator according to a first discrimination result output by the first discriminator;

inputting the new real text hash code generated by the deep neural network after the parameters are updated into a second generator, inputting the real image hash code into a second discriminator, and updating the network parameters of the second generator according to a second discrimination result output by the second discriminator.

A second aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the embodiments of the present invention when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect of embodiments of the present invention.

In the cross-modal hash retrieval method provided by the first aspect of the embodiment of the present invention, a real image hash code of real image data and a real text hash code of real text data are generated by a deep neural network; acquiring similarity measurement according to a common label between the real image data and the real text data; performing inter-modal loss optimization and intra-modal loss optimization according to the real image hash code, the real text hash code and the similarity measurement to update parameters of the neural network and generate a public hash code; inputting a new real image hash code generated by the neural network after the parameters are updated into the first generator, inputting a real text hash code into the first discriminator, and updating the network parameters of the first generator according to a first discrimination result output by the first discriminator; inputting the new real text hash code generated by the neural network after updating the parameters into a second generator, inputting the real image hash code into a second discriminator, updating the network parameters of the second generator according to the second identification result output by the second identifier, introducing a bidirectional countermeasure network model consisting of two symmetrical generative countermeasure networks into the deep cross-modal hash retrieval process, is used for optimizing the information richness asymmetry problem of the synonymous heterogeneous data in the cross-modal retrieval task, integrates the processes of inter-modal loss optimization, intra-modal loss optimization and bidirectional confrontation network optimization into an end-to-end algorithm model framework, meanwhile, an online hash code hot updating mechanism is used for updating the real image hash code and the real text hash code in real time, and organic connection and efficient multiplexing of a characteristic learning process and a hash learning process can be realized.

It is understood that, the beneficial effects of the second aspect and the third aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a cross-modal hash retrieval method according to an embodiment of the present invention;

fig. 2 is a schematic data flow diagram of a first data flow of a cross-modal hash retrieval method according to an embodiment of the present invention;

fig. 3 is a schematic data flow diagram of a second cross-modal hash retrieval method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The cross-modal hash retrieval method provided by the embodiment of the application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, Augmented Reality (AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, Personal Digital Assistants (PDA), servers and other terminal devices, and is used for realizing retrieval functions of image-text mutual search, automatic image recognition, automatic diversified classified albums and video sensitive frame recognition equal to image and text correlation. The embodiment of the present application does not set any limit to the specific type of the terminal device.

As shown in fig. 1, the cross-modal hash retrieval method provided in the embodiment of the present application includes the following steps S101 to S105:

step S101, generating a real image hash code of real image data and a real text hash code of real text data through a deep neural network.

In application, preset sample data can be input into the deep neural network, wherein the sample data comprises n samples, and each sample comprises a true image and a true text which are synonymously heterogeneous with each other. The real image data includes n real images, each of which may be specifically an image having a dimension of 224 × 224 × 3, and each of which may be specifically an RGB image, a depth image, a grayscale image, or the like. The real text data includes n real texts, each real image is synonymous with each corresponding real text, and each real text may specifically be an unprocessed or processed word, sentence, segment, chapter, symbol, character drawing, and the like, for example, a 1368-dimensional bag-of-words model vector generated through word frequency statistics, a word vector processed through word embedding, and the like. The embodiment of the present application does not set any limit to the specific types of the real image and the real text. Step S102, obtaining similarity measurement according to the common label between the real image data and the real text data.

In application, the common label includes object categories to which the n samples belong, and the object categories may be specifically represented as object category vectors. The common labels can be represented in a label matrix form, then the similarity matrix is calculated according to the label matrix, and then the similarity measurement is calculated according to the similarity matrix. The similarity matrix is the portion of the tag matrix that remains positive after transpose multiplication. The similarity measure is used as supervisory information in the deep neural network training process, and includes an inter-modal similarity measure between the real image data and the real text data, an intra-modal similarity measure of the real image data, and an intra-modal similarity measure of the real text data.

In one embodiment, step S101 includes:

extracting the image characteristics of each real image in the real image data through a pre-training model;

acquiring a real image hash code of each real image according to the image characteristics of each real image through a hyperbolic tangent function;

extracting text features of each real text in the real text data through a full-connection network;

acquiring a real text hash code of each real text according to the text characteristics of each real text by a hyperbolic tangent function;

step S102, comprising:

constructing a label matrix according to a common label between the real image data and the real text data;

and calculating a similarity matrix according to the label matrix, wherein the similarity matrix is used as a similarity measure in the process of training the deep neural network. In application, for real image data, the pre-training model may specifically be a VGG19 model, an Alexnet model, a CNN (Convolutional neural Network) -F model, a Residual Network (ResNet) model, a google net model, or the like, which is used to extract convolution characteristics of each real image, and then passes through a tanh (tanh) model) The function obtains the Hash code of each real image to obtain the Hash code vector P belonging to R of n real images^k*n(k is the number of binary bits of the hash code). For the real text data, a full-connection network can be used for extracting features, the Hash code of each real text is obtained through a hyperbolic tangent function, and Hash code vectors T belonging to R of n real texts are obtained^k*n。

S103, inter-modal loss optimization and intra-modal loss optimization are carried out according to the real image hash code, the real text hash code and the similarity measurement, so that the parameters of the deep neural network are updated, and a public hash code is generated.

In application, the inter-modal loss optimization is the basis of cross-modal retrieval algorithm optimization, and specifically, inter-modal loss optimization is performed according to the real image hash code, the real text hash code and the inter-modal similarity measurement between the real image data and the real text data so as to update parameters of the neural network and generate a public hash code.

In one embodiment, the expression of the inter-modal loss optimized objective function is as follows:

where H represents a default hash code, θ_xRepresenting parameters related to the real image in the deep neural network, x representing the real image, θ_yRepresenting parameters related to the real text in the deep neural network, y representing the real text, C₁A loss term representing the inter-modality loss, n represents the number of samples, i represents the count number of the real image, j represents the count number of the real text, S_ijThe similarity matrix is represented by a matrix of values,

representing the inner product of the similarity matrix, P representing the new true image hash code vector of the n samples, T representing the new true text hash code vector of the n samples, | | · | survival_FRepresenting the F-norm, the first term of the inter-modal loss optimization objective function uses a negative log-likelihood function as the inter-modal similarity measure, 1 represents an identity matrix.

In application, the likelihood function is defined as:

the real image hash code and the real text hash code can be obtained by the second item of the objective function of the inter-modal loss optimization, the balance of the hash codes is guaranteed by the third item, and the proportion of +1 to-1 in the hash bits generated by all the training samples is the same. During a specific training procedure, the following gradients were calculated, respectively:

the fixed real text input and default hash code H calculate the following gradient:

the fixed real image input and the default hash code H calculate the following gradient:

and updating parameters of the deep neural network through random gradient descent and error back propagation, and generating a public hash code.

In application, because a plurality of similar sample data exist in the same mode, the loss in the optimized mode can effectively improve the retrieval effect. Specifically, intra-modal loss optimization is performed according to the real image hash code, the real text hash code, intra-modal similarity measurement of the real image data, and intra-modal similarity measurement of the real text data, so as to update parameters of the neural network and generate a common hash code.

In one embodiment, the intra-modal loss optimization objective function is expressed as follows:

wherein, C₂A loss term representing the intra-modal loss, the first term of the objective function using a negative log-likelihood function as the intra-modal similarity measure for the real image data, the first term of the intra-modal loss-optimized objective function using a negative log-likelihood function as the intra-modal similarity measure for the real image data, and the second term of the intra-modal loss-optimization using a negative log-likelihood function as the intra-modal similarity measure for the real text data.

In application, in the intra-modal loss optimization process, a negative log-likelihood function is still used as similarity measurement in each modality of real image data and real text data, and the hash code of each modality is updated. Similar to the optimization of the loss between the modalities, the input of one modality data and the default hash code is fixed, then the gradient of the other modality is calculated, the parameters of the deep neural network are updated by using a training method similar to the above method, and a public hash code is generated.

In one embodiment, the expression of the public hash code is as follows:

H’＝sign(P+T)；

where H' represents a common hash code and sign () represents a sign function.

Step S104, inputting a new real image hash code generated by the deep neural network after updating parameters into a first generator, inputting the real text hash code into a first discriminator, and updating network parameters of the first generator according to a first discrimination result output by the first discriminator;

step S105, inputting the new real text hash code generated by the deep neural network after updating the parameters into a second generator, inputting the real image hash code into a second discriminator, and updating the network parameters of the second generator according to a second discrimination result output by the second discriminator.

In application, the discriminator uses a static hash code generated by one mode as real data, namely a first discriminator uses a real text hash code as real data, and a second discriminator uses a real image hash code as real data; the generators use the new hash code dynamically generated by the other modality as noise input, i.e. the first generator uses the new real image hash code as noise input and the second generator uses the new real text hash code as noise input. Due to the use of the chain rule, the parameters of the hash code generation network (namely, the deep neural network) of the corresponding modality are updated while the network parameters of the generator are updated. This enables the modality to generate a hash code closer to the other modality (by taking the hash code of the other modality as the input of the discriminator decision), thereby ultimately improving the cross-modality retrieval performance.

In application, the bidirectional countermeasures are optimized through a bidirectional countermeasures module according to the hash codes of the two modals generated previously. The bidirectional countermeasure module is composed of two symmetrical generative countermeasure networks, each generative countermeasure network is composed of two basic units of a generator and a discriminator, namely, the first generator and the second discriminator form one generative countermeasure network, and the second generator and the second discriminator form the other generative countermeasure network. The expression for generating the objective function of the antagonistic network is as follows:

where r represents a real image, z represents input noise of the generator G, G () represents an image generated via the generator, D () represents a probability that the discriminator judges the image to be true, if the discriminator inputs the real image r, D (r) should approach 1 indefinitely, and E () represents a mathematical expectation. The purpose of the generator is to make the probability that the discriminator judges the image generated by it to be true large, i.e. D (g (z)) as large as possible. The purpose of the discriminator is to make D (r) as large as possible and D (g (z)) as small as possible. The generator and the discriminator belong to the opposite relationship of mutual games. The first term of the objective function of the generative countermeasure network is the logarithmic expectation of the discriminator determining the true data, and the second term is the logarithmic expectation of the discriminator determining the generated data. The purpose of training the generative confrontation network is to update the parameters of the generator and the discriminator to improve the performance of the model, so that the image output by the generator can be as close as possible to the real image to pass the judgment of the discriminator.

In one embodiment, the gradient expression of the first discriminator is as follows:

the gradient expression of the first generator is as follows:

wherein the content of the first and second substances,

representing the gradient of the first discriminator,

representing said real text hash code, D₁() Representing the probability that the first discriminator discriminates the image as genuine, G (H)_x) Representing a virtual image hash code, H, generated by the first generator_xRepresenting the new real image hash code,

representing the gradient of the first generator, sign () representing a sign function,

a true text hash code vector representing the n samples, P representing a new true image hash code vector of the n samples;

the gradient expression of the second discriminator is as follows:

the gradient expression of the second generator is as follows:

wherein the content of the first and second substances,

a gradient representing the second discriminator,

representing said real image hash code, D₂() Representing the probability that the second discriminator discriminates the image as genuine, G (H)_y) Representing a virtual text hash code, H, generated by said second generator_yRepresenting the new real-text hash code,

representing the gradient of the second generator and,

a true image hash code vector representing the n samples, and T represents a new true text hash code vector of the n samples.

In application, the optimization process of the two-way countermeasure module is a relatively independent process. Go outIn consideration of training efficiency, the Adam optimizer is adopted to solve the problems of sparse gradient and significant noise sensitivity of the bidirectional countermeasure network. In the specific training process, the existing text hash code is firstly used

As input to the first discriminator, and then using the dynamically generated image hash as the first generator hash H_xThe existing image hash code is input in the same way

As input to the second discriminator, and then using the dynamically generated text hash as the second generator hash H_yThe optimization process focuses on updating the parameters of the generator because they are associated with the hash code generation network parameters of the corresponding modality. The hash code is not updated in the process, and the retrieval effect is indirectly improved by improving the performance of the generator.

In one embodiment, after step S101, the method further includes:

obtaining a classification result of the real image data according to the image characteristics of each real image through an S-shaped function;

and obtaining the classification result of the real text data according to the image characteristics of each real text through an S-shaped function.

In application, for real image data, classifying each real image through an S-type (sigmoid) function to obtain a classification result vector U belonging to R of n real images^l*n(l is the number of target classes). Classifying each real text by an S-type function to obtain n classification result vectors of the real texts, wherein the n classification result vectors belong to the R^l*n。

In one embodiment, the method further comprises:

and performing classification loss optimization according to the label matrix, the classification result of the real image data and the classification result of the real text data.

In application, in order to fully utilize label information, a classification loss optimization process is added in the cross-modal Hash retrieval method.

In one embodiment, the classification loss optimized objective function is expressed as follows:

wherein, theta_xRepresenting parameters related to the real image in the deep neural network, x representing the real image, θ_yRepresenting parameters related to the real text in the deep neural network, y representing the real text, C₃A loss term representing the classification loss, I represents the number of the object classes to which the sample belongs, n represents the number of the samples, J represents the number of the samples, L represents the number of the samples_IJRepresents the label matrix, U_IJA classification result vector, V, representing said real image data_IJA classification result vector representing the authentic text data, the first and second terms of the classification loss optimized objective function both using negative log likelihood functions.

In application, the same likelihood function as in equation 1 is used in the classification optimization process, except that the similarity matrix is replaced by a label matrix. It is worth noting that the classification optimization process does not introduce hash codes for computation because the classification result output in the network is parallel to the hash code output and is close in form to the label vector. Similar training approaches are still used to compute gradients and update network parameters, but the classification optimization process does not update the hash code. The retrieval effect can be indirectly improved by improving the classification performance of the deep neural network.

Fig. 2 and 3 are schematic diagrams illustrating data flow corresponding to a cross-modal hash retrieval method.

The cross-modal Hash retrieval method provided by the embodiment of the invention is used for optimizing the information abundance asymmetry problem of synonymous heterogeneous data in a cross-modal retrieval task by introducing a bidirectional countermeasure network model consisting of two symmetrical generative countermeasure networks into a deep cross-modal Hash retrieval process, integrating the processes of inter-modal loss optimization, intra-modal loss optimization and bidirectional countermeasure network optimization into an end-to-end algorithm model frame, and meanwhile, an online Hash code thermal updating mechanism is used for updating a real image Hash code and a real text Hash code in real time, so that the organic connection and efficient multiplexing of a characteristic learning process and a Hash learning process can be realized.

In the whole process of the cross-modal Hash retrieval method, a default Hash code, a loss function structure and gradient calculation are randomly generated, and then the Hash code is continuously updated in real time by using image and text data in a plurality of iterative processes. The generation of the Hash code needs to go through two processes of feature learning and Hash learning, and the online Hash code hot updating mechanism has the advantages that the network parameters of feature extraction and Hash generation can be continuously updated in real time in the whole training optimization process, and the feature learning and the Hash learning are linked into an organic whole.

Offline parameter locking is an efficient multiplexing mechanism used in the later-stage tuning of the cross-modal hash retrieval method. After a preliminary model is obtained through training, key parameters (such as node weights) of a partial feature extraction layer of the network and overall hyper-parameters of a cross-modal hash retrieval method can be locked according to requirements, a basic optimal setting is reserved, and then targeted optimization is carried out.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

As shown in fig. 4, an embodiment of the present application further provides a terminal device 100, including: at least one processor 10 (only one processor is shown in fig. 4), a memory 11, and a computer program 12 stored in the memory 11 and executable on the at least one processor 10, the steps in the various cross-modal hash retrieval method embodiments described above being implemented when the computer program 12 is executed by the processor 10.

In application, the terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 4 is merely an example of a terminal device, and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, etc.

In an Application, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In some embodiments, the storage may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of computer programs. The memory may also be used to temporarily store data that has been output or is to be output.

An embodiment of the present application further provides a network device, where the network device includes: the cross-modal hash retrieval method comprises at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor implements the steps of the cross-modal hash retrieval method embodiments when executing the computer program.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when being executed by a processor, the computer program implements the steps in the foregoing cross-modal hash search method embodiments.

The embodiment of the present application provides a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the foregoing cross-modal hash retrieval method embodiments when executed.

All or part of the flow in the method of the embodiments described above can be implemented by a computer program that instructs related hardware to complete, and the computer program can be stored in a computer readable storage medium, and when being executed by a processor, the computer program can implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative algorithmic steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and merely one logical functional division may be implemented in practice in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A cross-modal hash retrieval method is characterized by comprising the following steps:

2. The cross-modal hash retrieval method of claim 1, wherein the generating a true image hash code of the true image data and a true text hash code of the true text data by the deep neural network comprises:

extracting the image characteristics of each real image in the real image data through a pre-training model; wherein the real image data includes n real images;

extracting text features of each real text in the real text data through a full-connection network; the real text data comprises n real texts, and each real image is synonymously heterogeneous with each corresponding real text;

the obtaining a similarity measure according to a common label between the real image data and the real text data includes:

constructing a label matrix according to a common label between the real image data and the real text data; wherein the common label comprises object categories to which n samples belong, and each sample comprises a true image and a true text which are synonymously heterogeneous;

and calculating a similarity matrix used as a similarity measure in the process of training the deep neural network according to the label matrix.

3. The cross-modal hash retrieval method of claim 2, wherein the performing inter-modal loss optimization and intra-modal loss optimization based on the real image hash code, the real text hash code, and the similarity metric to update parameters of the deep neural network and generate a common hash code comprises:

performing inter-modal loss optimization according to the real image hash code, the real text hash code and the inter-modal similarity measure between the real image data and the real text data to update parameters of the deep neural network and generate a common hash code;

and performing intra-modal loss optimization according to the real image hash code, the real text hash code, the intra-modal similarity measure of the real image data and the intra-modal similarity measure of the real text data to update the parameters of the deep neural network and generate a public hash code.

4. The cross-modal hash retrieval method of claim 3, wherein the expression of the inter-modal loss optimized objective function is as follows:

representing the inner product of the similarity matrix, P representing the new true image hash code vector of the n samples, T representing the new true text hash code vector of the n samples, | | · | survival_FRepresenting an F-norm, a first term of the inter-modal loss optimization objective function using a negative log-likelihood function as the inter-modal similarity measure, 1 representing an identity matrix;

the expression of the intra-modal loss optimization objective function is as follows:

wherein, C₂A loss term representing the intra-modal loss, a first term of the objective function using a negative log-likelihood function as an intra-modal similarity measure for the real image data, a first term of the intra-modal loss optimization using a negative log-likelihood function as an intra-modal similarity measure for the real image data, and a second term of the intra-modal loss optimization objective function using a negative log-likelihood function as an intra-modal similarity measure for the real text data;

the expression of the public hash code is as follows:

H’＝sign(P+T)；

where H' represents a common hash code and sign () represents a sign function.

5. The cross-modal hash retrieval method of claim 2, wherein the gradient expression of the first discriminator is as follows:

the gradient expression of the first generator is as follows:

H_x＝sign(P)；

wherein the content of the first and second substances,

representing the gradient of the first discriminator,

the gradient expression of the second discriminator is as follows:

the gradient expression of the second generator is as follows:

H_y＝sign(T)；

wherein the content of the first and second substances,

a gradient representing the second discriminator,

representing the gradient of the second generator and,

6. The cross-modal hash retrieval method of any of claims 2 to 5, wherein each of the real images is an image having dimensions of 224 x 3;

each text is a 1368-dimensional bag-of-words model vector generated by word frequency statistics;

each of the common labels comprises a vector of object classes to which the n samples belong.

7. The cross-modal hash retrieval method of any of claims 2 to 5, wherein after extracting the image features of each real image in the real image data through the pre-trained model, the method further comprises:

after extracting the text features of each text in the real text data through the full-connection network, the method further comprises the following steps:

8. The cross-modal hash retrieval method of claim 7, wherein the method further comprises:

9. The cross-modal hash retrieval method of claim 8, wherein the classification loss optimized objective function is expressed as follows:

wherein, theta_xRepresenting parameters related to the real image in the deep neural network, x representing the real image, θ_yRepresenting parameters related to the real text in the deep neural network, y representing the real text, C₃A loss term representing a loss of said classificationL represents the number of the object classes, I represents the serial number of the object class to which the sample belongs, n represents the number of the samples, J represents the serial number of the samples, L represents the serial number of the samples_IJRepresents the label matrix, U_IJA classification result vector, V, representing said real image data_IJA classification result vector representing the authentic text data, the first and second terms of the classification loss optimized objective function both using negative log likelihood functions.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.