CN112364198A - Cross-modal Hash retrieval method, terminal device and storage medium - Google Patents

Cross-modal Hash retrieval method, terminal device and storage medium Download PDF

Info

Publication number
CN112364198A
CN112364198A CN202011289807.3A CN202011289807A CN112364198A CN 112364198 A CN112364198 A CN 112364198A CN 202011289807 A CN202011289807 A CN 202011289807A CN 112364198 A CN112364198 A CN 112364198A
Authority
CN
China
Prior art keywords
real
hash code
text
representing
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011289807.3A
Other languages
Chinese (zh)
Other versions
CN112364198B (en
Inventor
曹文明
冯文铄
曹桂涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202011289807.3A priority Critical patent/CN112364198B/en
Publication of CN112364198A publication Critical patent/CN112364198A/en
Application granted granted Critical
Publication of CN112364198B publication Critical patent/CN112364198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of cross-modal Hash retrieval, and provides a cross-modal Hash retrieval method, terminal equipment and a storage medium.

Description

Cross-modal Hash retrieval method, terminal device and storage medium
Technical Field
The invention belongs to the technical field of cross-modal hash retrieval, and particularly relates to a cross-modal hash retrieval method, terminal equipment and a storage medium.
Background
The cross-modal hash retrieval method is one of the mainstream methods for realizing quick and effective retrieval in multi-modal data at present. The deep cross-modal hash retrieval method which combines feature learning and hash learning by using the deep neural network also shows better performance than the traditional cross-modal hash retrieval method. However, the mutual deviation between the dense parameter update of the deep neural network and the sparse index characteristic of the hash code aggravates the problem that the information richness of the existing synonymous heterogeneous multi-modal data in the original space and the hamming space is asymmetric, and causes great obstacle to the improvement of the stability and the performance of the retrieval algorithm.
Disclosure of Invention
In view of this, embodiments of the present invention provide a cross-modal hash retrieval method, a terminal device, and a storage medium, which can implement organic join and efficient multiplexing of a feature learning process and a hash learning process.
A first aspect of an embodiment of the present invention provides a cross-modal hash retrieval method, including:
generating a real image hash code of the real image data and a real text hash code of the real text data through a deep neural network;
acquiring similarity measurement according to a common label between the real image data and the real text data;
performing inter-modal loss optimization and intra-modal loss optimization according to the real image hash code, the real text hash code and the similarity measure to update parameters of the deep neural network and generate a public hash code;
inputting a new real image hash code generated by the deep neural network after the parameters are updated into a first generator, inputting the real text hash code into a first discriminator, and updating the network parameters of the first generator according to a first discrimination result output by the first discriminator;
inputting the new real text hash code generated by the deep neural network after the parameters are updated into a second generator, inputting the real image hash code into a second discriminator, and updating the network parameters of the second generator according to a second discrimination result output by the second discriminator.
A second aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the embodiments of the present invention when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect of embodiments of the present invention.
In the cross-modal hash retrieval method provided by the first aspect of the embodiment of the present invention, a real image hash code of real image data and a real text hash code of real text data are generated by a deep neural network; acquiring similarity measurement according to a common label between the real image data and the real text data; performing inter-modal loss optimization and intra-modal loss optimization according to the real image hash code, the real text hash code and the similarity measurement to update parameters of the neural network and generate a public hash code; inputting a new real image hash code generated by the neural network after the parameters are updated into the first generator, inputting a real text hash code into the first discriminator, and updating the network parameters of the first generator according to a first discrimination result output by the first discriminator; inputting the new real text hash code generated by the neural network after updating the parameters into a second generator, inputting the real image hash code into a second discriminator, updating the network parameters of the second generator according to the second identification result output by the second identifier, introducing a bidirectional countermeasure network model consisting of two symmetrical generative countermeasure networks into the deep cross-modal hash retrieval process, is used for optimizing the information richness asymmetry problem of the synonymous heterogeneous data in the cross-modal retrieval task, integrates the processes of inter-modal loss optimization, intra-modal loss optimization and bidirectional confrontation network optimization into an end-to-end algorithm model framework, meanwhile, an online hash code hot updating mechanism is used for updating the real image hash code and the real text hash code in real time, and organic connection and efficient multiplexing of a characteristic learning process and a hash learning process can be realized.
It is understood that, the beneficial effects of the second aspect and the third aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a cross-modal hash retrieval method according to an embodiment of the present invention;
fig. 2 is a schematic data flow diagram of a first data flow of a cross-modal hash retrieval method according to an embodiment of the present invention;
fig. 3 is a schematic data flow diagram of a second cross-modal hash retrieval method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The cross-modal hash retrieval method provided by the embodiment of the application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, Augmented Reality (AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, Personal Digital Assistants (PDA), servers and other terminal devices, and is used for realizing retrieval functions of image-text mutual search, automatic image recognition, automatic diversified classified albums and video sensitive frame recognition equal to image and text correlation. The embodiment of the present application does not set any limit to the specific type of the terminal device.
As shown in fig. 1, the cross-modal hash retrieval method provided in the embodiment of the present application includes the following steps S101 to S105:
step S101, generating a real image hash code of real image data and a real text hash code of real text data through a deep neural network.
In application, preset sample data can be input into the deep neural network, wherein the sample data comprises n samples, and each sample comprises a true image and a true text which are synonymously heterogeneous with each other. The real image data includes n real images, each of which may be specifically an image having a dimension of 224 × 224 × 3, and each of which may be specifically an RGB image, a depth image, a grayscale image, or the like. The real text data includes n real texts, each real image is synonymous with each corresponding real text, and each real text may specifically be an unprocessed or processed word, sentence, segment, chapter, symbol, character drawing, and the like, for example, a 1368-dimensional bag-of-words model vector generated through word frequency statistics, a word vector processed through word embedding, and the like. The embodiment of the present application does not set any limit to the specific types of the real image and the real text. Step S102, obtaining similarity measurement according to the common label between the real image data and the real text data.
In application, the common label includes object categories to which the n samples belong, and the object categories may be specifically represented as object category vectors. The common labels can be represented in a label matrix form, then the similarity matrix is calculated according to the label matrix, and then the similarity measurement is calculated according to the similarity matrix. The similarity matrix is the portion of the tag matrix that remains positive after transpose multiplication. The similarity measure is used as supervisory information in the deep neural network training process, and includes an inter-modal similarity measure between the real image data and the real text data, an intra-modal similarity measure of the real image data, and an intra-modal similarity measure of the real text data.
In one embodiment, step S101 includes:
extracting the image characteristics of each real image in the real image data through a pre-training model;
acquiring a real image hash code of each real image according to the image characteristics of each real image through a hyperbolic tangent function;
extracting text features of each real text in the real text data through a full-connection network;
acquiring a real text hash code of each real text according to the text characteristics of each real text by a hyperbolic tangent function;
step S102, comprising:
constructing a label matrix according to a common label between the real image data and the real text data;
and calculating a similarity matrix according to the label matrix, wherein the similarity matrix is used as a similarity measure in the process of training the deep neural network. In application, for real image data, the pre-training model may specifically be a VGG19 model, an Alexnet model, a CNN (Convolutional neural Network) -F model, a Residual Network (ResNet) model, a google net model, or the like, which is used to extract convolution characteristics of each real image, and then passes through a tanh (tanh) model) The function obtains the Hash code of each real image to obtain the Hash code vector P belonging to R of n real imagesk*n(k is the number of binary bits of the hash code). For the real text data, a full-connection network can be used for extracting features, the Hash code of each real text is obtained through a hyperbolic tangent function, and Hash code vectors T belonging to R of n real texts are obtainedk*n
S103, inter-modal loss optimization and intra-modal loss optimization are carried out according to the real image hash code, the real text hash code and the similarity measurement, so that the parameters of the deep neural network are updated, and a public hash code is generated.
In application, the inter-modal loss optimization is the basis of cross-modal retrieval algorithm optimization, and specifically, inter-modal loss optimization is performed according to the real image hash code, the real text hash code and the inter-modal similarity measurement between the real image data and the real text data so as to update parameters of the neural network and generate a public hash code.
In one embodiment, the expression of the inter-modal loss optimized objective function is as follows:
Figure BDA0002783467610000061
where H represents a default hash code, θxRepresenting parameters related to the real image in the deep neural network, x representing the real image, θyRepresenting parameters related to the real text in the deep neural network, y representing the real text, C1A loss term representing the inter-modality loss, n represents the number of samples, i represents the count number of the real image, j represents the count number of the real text, SijThe similarity matrix is represented by a matrix of values,
Figure BDA0002783467610000071
representing the inner product of the similarity matrix, P representing the new true image hash code vector of the n samples, T representing the new true text hash code vector of the n samples, | | · | survivalFRepresenting the F-norm, the first term of the inter-modal loss optimization objective function uses a negative log-likelihood function as the inter-modal similarity measure, 1 represents an identity matrix.
In application, the likelihood function is defined as:
Figure BDA0002783467610000072
Figure BDA0002783467610000073
the real image hash code and the real text hash code can be obtained by the second item of the objective function of the inter-modal loss optimization, the balance of the hash codes is guaranteed by the third item, and the proportion of +1 to-1 in the hash bits generated by all the training samples is the same. During a specific training procedure, the following gradients were calculated, respectively:
the fixed real text input and default hash code H calculate the following gradient:
Figure BDA0002783467610000074
the fixed real image input and the default hash code H calculate the following gradient:
Figure BDA0002783467610000075
and updating parameters of the deep neural network through random gradient descent and error back propagation, and generating a public hash code.
In application, because a plurality of similar sample data exist in the same mode, the loss in the optimized mode can effectively improve the retrieval effect. Specifically, intra-modal loss optimization is performed according to the real image hash code, the real text hash code, intra-modal similarity measurement of the real image data, and intra-modal similarity measurement of the real text data, so as to update parameters of the neural network and generate a common hash code.
In one embodiment, the intra-modal loss optimization objective function is expressed as follows:
Figure BDA0002783467610000081
wherein, C2A loss term representing the intra-modal loss, the first term of the objective function using a negative log-likelihood function as the intra-modal similarity measure for the real image data, the first term of the intra-modal loss-optimized objective function using a negative log-likelihood function as the intra-modal similarity measure for the real image data, and the second term of the intra-modal loss-optimization using a negative log-likelihood function as the intra-modal similarity measure for the real text data.
In application, in the intra-modal loss optimization process, a negative log-likelihood function is still used as similarity measurement in each modality of real image data and real text data, and the hash code of each modality is updated. Similar to the optimization of the loss between the modalities, the input of one modality data and the default hash code is fixed, then the gradient of the other modality is calculated, the parameters of the deep neural network are updated by using a training method similar to the above method, and a public hash code is generated.
In one embodiment, the expression of the public hash code is as follows:
H’=sign(P+T);
where H' represents a common hash code and sign () represents a sign function.
Step S104, inputting a new real image hash code generated by the deep neural network after updating parameters into a first generator, inputting the real text hash code into a first discriminator, and updating network parameters of the first generator according to a first discrimination result output by the first discriminator;
step S105, inputting the new real text hash code generated by the deep neural network after updating the parameters into a second generator, inputting the real image hash code into a second discriminator, and updating the network parameters of the second generator according to a second discrimination result output by the second discriminator.
In application, the discriminator uses a static hash code generated by one mode as real data, namely a first discriminator uses a real text hash code as real data, and a second discriminator uses a real image hash code as real data; the generators use the new hash code dynamically generated by the other modality as noise input, i.e. the first generator uses the new real image hash code as noise input and the second generator uses the new real text hash code as noise input. Due to the use of the chain rule, the parameters of the hash code generation network (namely, the deep neural network) of the corresponding modality are updated while the network parameters of the generator are updated. This enables the modality to generate a hash code closer to the other modality (by taking the hash code of the other modality as the input of the discriminator decision), thereby ultimately improving the cross-modality retrieval performance.
In application, the bidirectional countermeasures are optimized through a bidirectional countermeasures module according to the hash codes of the two modals generated previously. The bidirectional countermeasure module is composed of two symmetrical generative countermeasure networks, each generative countermeasure network is composed of two basic units of a generator and a discriminator, namely, the first generator and the second discriminator form one generative countermeasure network, and the second generator and the second discriminator form the other generative countermeasure network. The expression for generating the objective function of the antagonistic network is as follows:
Figure BDA0002783467610000091
where r represents a real image, z represents input noise of the generator G, G () represents an image generated via the generator, D () represents a probability that the discriminator judges the image to be true, if the discriminator inputs the real image r, D (r) should approach 1 indefinitely, and E () represents a mathematical expectation. The purpose of the generator is to make the probability that the discriminator judges the image generated by it to be true large, i.e. D (g (z)) as large as possible. The purpose of the discriminator is to make D (r) as large as possible and D (g (z)) as small as possible. The generator and the discriminator belong to the opposite relationship of mutual games. The first term of the objective function of the generative countermeasure network is the logarithmic expectation of the discriminator determining the true data, and the second term is the logarithmic expectation of the discriminator determining the generated data. The purpose of training the generative confrontation network is to update the parameters of the generator and the discriminator to improve the performance of the model, so that the image output by the generator can be as close as possible to the real image to pass the judgment of the discriminator.
In one embodiment, the gradient expression of the first discriminator is as follows:
Figure BDA0002783467610000092
the gradient expression of the first generator is as follows:
Figure BDA0002783467610000093
Figure BDA0002783467610000094
wherein the content of the first and second substances,
Figure BDA0002783467610000101
representing the gradient of the first discriminator,
Figure BDA0002783467610000102
representing said real text hash code, D1() Representing the probability that the first discriminator discriminates the image as genuine, G (H)x) Representing a virtual image hash code, H, generated by the first generatorxRepresenting the new real image hash code,
Figure BDA0002783467610000103
representing the gradient of the first generator, sign () representing a sign function,
Figure BDA0002783467610000104
a true text hash code vector representing the n samples, P representing a new true image hash code vector of the n samples;
the gradient expression of the second discriminator is as follows:
Figure BDA0002783467610000105
the gradient expression of the second generator is as follows:
Figure BDA0002783467610000106
Figure BDA0002783467610000107
wherein the content of the first and second substances,
Figure BDA0002783467610000108
a gradient representing the second discriminator,
Figure BDA0002783467610000109
representing said real image hash code, D2() Representing the probability that the second discriminator discriminates the image as genuine, G (H)y) Representing a virtual text hash code, H, generated by said second generatoryRepresenting the new real-text hash code,
Figure BDA00027834676100001010
representing the gradient of the second generator and,
Figure BDA00027834676100001011
a true image hash code vector representing the n samples, and T represents a new true text hash code vector of the n samples.
In application, the optimization process of the two-way countermeasure module is a relatively independent process. Go outIn consideration of training efficiency, the Adam optimizer is adopted to solve the problems of sparse gradient and significant noise sensitivity of the bidirectional countermeasure network. In the specific training process, the existing text hash code is firstly used
Figure BDA00027834676100001012
As input to the first discriminator, and then using the dynamically generated image hash as the first generator hash HxThe existing image hash code is input in the same way
Figure BDA00027834676100001013
As input to the second discriminator, and then using the dynamically generated text hash as the second generator hash HyThe optimization process focuses on updating the parameters of the generator because they are associated with the hash code generation network parameters of the corresponding modality. The hash code is not updated in the process, and the retrieval effect is indirectly improved by improving the performance of the generator.
In one embodiment, after step S101, the method further includes:
obtaining a classification result of the real image data according to the image characteristics of each real image through an S-shaped function;
and obtaining the classification result of the real text data according to the image characteristics of each real text through an S-shaped function.
In application, for real image data, classifying each real image through an S-type (sigmoid) function to obtain a classification result vector U belonging to R of n real imagesl*n(l is the number of target classes). Classifying each real text by an S-type function to obtain n classification result vectors of the real texts, wherein the n classification result vectors belong to the Rl*n
In one embodiment, the method further comprises:
and performing classification loss optimization according to the label matrix, the classification result of the real image data and the classification result of the real text data.
In application, in order to fully utilize label information, a classification loss optimization process is added in the cross-modal Hash retrieval method.
In one embodiment, the classification loss optimized objective function is expressed as follows:
Figure BDA0002783467610000111
wherein, thetaxRepresenting parameters related to the real image in the deep neural network, x representing the real image, θyRepresenting parameters related to the real text in the deep neural network, y representing the real text, C3A loss term representing the classification loss, I represents the number of the object classes to which the sample belongs, n represents the number of the samples, J represents the number of the samples, L represents the number of the samplesIJRepresents the label matrix, UIJA classification result vector, V, representing said real image dataIJA classification result vector representing the authentic text data, the first and second terms of the classification loss optimized objective function both using negative log likelihood functions.
In application, the same likelihood function as in equation 1 is used in the classification optimization process, except that the similarity matrix is replaced by a label matrix. It is worth noting that the classification optimization process does not introduce hash codes for computation because the classification result output in the network is parallel to the hash code output and is close in form to the label vector. Similar training approaches are still used to compute gradients and update network parameters, but the classification optimization process does not update the hash code. The retrieval effect can be indirectly improved by improving the classification performance of the deep neural network.
Fig. 2 and 3 are schematic diagrams illustrating data flow corresponding to a cross-modal hash retrieval method.
The cross-modal Hash retrieval method provided by the embodiment of the invention is used for optimizing the information abundance asymmetry problem of synonymous heterogeneous data in a cross-modal retrieval task by introducing a bidirectional countermeasure network model consisting of two symmetrical generative countermeasure networks into a deep cross-modal Hash retrieval process, integrating the processes of inter-modal loss optimization, intra-modal loss optimization and bidirectional countermeasure network optimization into an end-to-end algorithm model frame, and meanwhile, an online Hash code thermal updating mechanism is used for updating a real image Hash code and a real text Hash code in real time, so that the organic connection and efficient multiplexing of a characteristic learning process and a Hash learning process can be realized.
In the whole process of the cross-modal Hash retrieval method, a default Hash code, a loss function structure and gradient calculation are randomly generated, and then the Hash code is continuously updated in real time by using image and text data in a plurality of iterative processes. The generation of the Hash code needs to go through two processes of feature learning and Hash learning, and the online Hash code hot updating mechanism has the advantages that the network parameters of feature extraction and Hash generation can be continuously updated in real time in the whole training optimization process, and the feature learning and the Hash learning are linked into an organic whole.
Offline parameter locking is an efficient multiplexing mechanism used in the later-stage tuning of the cross-modal hash retrieval method. After a preliminary model is obtained through training, key parameters (such as node weights) of a partial feature extraction layer of the network and overall hyper-parameters of a cross-modal hash retrieval method can be locked according to requirements, a basic optimal setting is reserved, and then targeted optimization is carried out.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
As shown in fig. 4, an embodiment of the present application further provides a terminal device 100, including: at least one processor 10 (only one processor is shown in fig. 4), a memory 11, and a computer program 12 stored in the memory 11 and executable on the at least one processor 10, the steps in the various cross-modal hash retrieval method embodiments described above being implemented when the computer program 12 is executed by the processor 10.
In application, the terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 4 is merely an example of a terminal device, and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, etc.
In an Application, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In some embodiments, the storage may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of computer programs. The memory may also be used to temporarily store data that has been output or is to be output.
An embodiment of the present application further provides a network device, where the network device includes: the cross-modal hash retrieval method comprises at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor implements the steps of the cross-modal hash retrieval method embodiments when executing the computer program.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when being executed by a processor, the computer program implements the steps in the foregoing cross-modal hash search method embodiments.
The embodiment of the present application provides a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the foregoing cross-modal hash retrieval method embodiments when executed.
All or part of the flow in the method of the embodiments described above can be implemented by a computer program that instructs related hardware to complete, and the computer program can be stored in a computer readable storage medium, and when being executed by a processor, the computer program can implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative algorithmic steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and merely one logical functional division may be implemented in practice in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (11)

1. A cross-modal hash retrieval method is characterized by comprising the following steps:
generating a real image hash code of the real image data and a real text hash code of the real text data through a deep neural network;
acquiring similarity measurement according to a common label between the real image data and the real text data;
performing inter-modal loss optimization and intra-modal loss optimization according to the real image hash code, the real text hash code and the similarity measure to update parameters of the deep neural network and generate a public hash code;
inputting a new real image hash code generated by the deep neural network after the parameters are updated into a first generator, inputting the real text hash code into a first discriminator, and updating the network parameters of the first generator according to a first discrimination result output by the first discriminator;
inputting the new real text hash code generated by the deep neural network after the parameters are updated into a second generator, inputting the real image hash code into a second discriminator, and updating the network parameters of the second generator according to a second discrimination result output by the second discriminator.
2. The cross-modal hash retrieval method of claim 1, wherein the generating a true image hash code of the true image data and a true text hash code of the true text data by the deep neural network comprises:
extracting the image characteristics of each real image in the real image data through a pre-training model; wherein the real image data includes n real images;
acquiring a real image hash code of each real image according to the image characteristics of each real image through a hyperbolic tangent function;
extracting text features of each real text in the real text data through a full-connection network; the real text data comprises n real texts, and each real image is synonymously heterogeneous with each corresponding real text;
acquiring a real text hash code of each real text according to the text characteristics of each real text by a hyperbolic tangent function;
the obtaining a similarity measure according to a common label between the real image data and the real text data includes:
constructing a label matrix according to a common label between the real image data and the real text data; wherein the common label comprises object categories to which n samples belong, and each sample comprises a true image and a true text which are synonymously heterogeneous;
and calculating a similarity matrix used as a similarity measure in the process of training the deep neural network according to the label matrix.
3. The cross-modal hash retrieval method of claim 2, wherein the performing inter-modal loss optimization and intra-modal loss optimization based on the real image hash code, the real text hash code, and the similarity metric to update parameters of the deep neural network and generate a common hash code comprises:
performing inter-modal loss optimization according to the real image hash code, the real text hash code and the inter-modal similarity measure between the real image data and the real text data to update parameters of the deep neural network and generate a common hash code;
and performing intra-modal loss optimization according to the real image hash code, the real text hash code, the intra-modal similarity measure of the real image data and the intra-modal similarity measure of the real text data to update the parameters of the deep neural network and generate a public hash code.
4. The cross-modal hash retrieval method of claim 3, wherein the expression of the inter-modal loss optimized objective function is as follows:
Figure FDA0002783467600000021
Figure FDA0002783467600000022
where H represents a default hash code, θxRepresenting parameters related to the real image in the deep neural network, x representing the real image, θyRepresenting parameters related to the real text in the deep neural network, y representing the real text, C1A loss term representing the inter-modality loss, n represents the number of samples, i represents the count number of the real image, j represents the count number of the real text, SijThe similarity matrix is represented by a matrix of values,
Figure FDA0002783467600000031
representing the inner product of the similarity matrix, P representing the new true image hash code vector of the n samples, T representing the new true text hash code vector of the n samples, | | · | survivalFRepresenting an F-norm, a first term of the inter-modal loss optimization objective function using a negative log-likelihood function as the inter-modal similarity measure, 1 representing an identity matrix;
the expression of the intra-modal loss optimization objective function is as follows:
Figure FDA0002783467600000032
wherein, C2A loss term representing the intra-modal loss, a first term of the objective function using a negative log-likelihood function as an intra-modal similarity measure for the real image data, a first term of the intra-modal loss optimization using a negative log-likelihood function as an intra-modal similarity measure for the real image data, and a second term of the intra-modal loss optimization objective function using a negative log-likelihood function as an intra-modal similarity measure for the real text data;
the expression of the public hash code is as follows:
H’=sign(P+T);
where H' represents a common hash code and sign () represents a sign function.
5. The cross-modal hash retrieval method of claim 2, wherein the gradient expression of the first discriminator is as follows:
Figure FDA0002783467600000033
the gradient expression of the first generator is as follows:
Figure FDA0002783467600000034
Figure FDA0002783467600000035
Hx=sign(P);
wherein the content of the first and second substances,
Figure FDA0002783467600000041
representing the gradient of the first discriminator,
Figure FDA0002783467600000042
representing said real text hash code, D1() Representing the probability that the first discriminator discriminates the image as genuine, G (H)x) Representing a virtual image hash code, H, generated by the first generatorxRepresenting the new real image hash code,
Figure FDA0002783467600000043
representing the gradient of the first generator, sign () representing a sign function,
Figure FDA0002783467600000044
a true text hash code vector representing the n samples, P representing a new true image hash code vector of the n samples;
the gradient expression of the second discriminator is as follows:
Figure FDA0002783467600000045
the gradient expression of the second generator is as follows:
Figure FDA0002783467600000046
Figure FDA0002783467600000047
Hy=sign(T);
wherein the content of the first and second substances,
Figure FDA0002783467600000048
a gradient representing the second discriminator,
Figure FDA0002783467600000049
representing said real image hash code, D2() Representing the probability that the second discriminator discriminates the image as genuine, G (H)y) Representing a virtual text hash code, H, generated by said second generatoryRepresenting the new real-text hash code,
Figure FDA00027834676000000410
representing the gradient of the second generator and,
Figure FDA00027834676000000411
a true image hash code vector representing the n samples, and T represents a new true text hash code vector of the n samples.
6. The cross-modal hash retrieval method of any of claims 2 to 5, wherein each of the real images is an image having dimensions of 224 x 3;
each text is a 1368-dimensional bag-of-words model vector generated by word frequency statistics;
each of the common labels comprises a vector of object classes to which the n samples belong.
7. The cross-modal hash retrieval method of any of claims 2 to 5, wherein after extracting the image features of each real image in the real image data through the pre-trained model, the method further comprises:
obtaining a classification result of the real image data according to the image characteristics of each real image through an S-shaped function;
after extracting the text features of each text in the real text data through the full-connection network, the method further comprises the following steps:
and obtaining the classification result of the real text data according to the image characteristics of each real text through an S-shaped function.
8. The cross-modal hash retrieval method of claim 7, wherein the method further comprises:
and performing classification loss optimization according to the label matrix, the classification result of the real image data and the classification result of the real text data.
9. The cross-modal hash retrieval method of claim 8, wherein the classification loss optimized objective function is expressed as follows:
Figure FDA0002783467600000051
wherein, thetaxRepresenting parameters related to the real image in the deep neural network, x representing the real image, θyRepresenting parameters related to the real text in the deep neural network, y representing the real text, C3A loss term representing a loss of said classificationL represents the number of the object classes, I represents the serial number of the object class to which the sample belongs, n represents the number of the samples, J represents the serial number of the samples, L represents the serial number of the samplesIJRepresents the label matrix, UIJA classification result vector, V, representing said real image dataIJA classification result vector representing the authentic text data, the first and second terms of the classification loss optimized objective function both using negative log likelihood functions.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.
11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN202011289807.3A 2020-11-17 2020-11-17 Cross-modal hash retrieval method, terminal equipment and storage medium Active CN112364198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011289807.3A CN112364198B (en) 2020-11-17 2020-11-17 Cross-modal hash retrieval method, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011289807.3A CN112364198B (en) 2020-11-17 2020-11-17 Cross-modal hash retrieval method, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112364198A true CN112364198A (en) 2021-02-12
CN112364198B CN112364198B (en) 2023-06-30

Family

ID=74532489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011289807.3A Active CN112364198B (en) 2020-11-17 2020-11-17 Cross-modal hash retrieval method, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112364198B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836068A (en) * 2021-03-24 2021-05-25 南京大学 Unsupervised cross-modal Hash retrieval method based on noisy label learning
CN113610080A (en) * 2021-08-04 2021-11-05 北京邮电大学 Cross-modal perception-based sensitive image identification method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
CN110765281A (en) * 2019-11-04 2020-02-07 山东浪潮人工智能研究院有限公司 Multi-semantic depth supervision cross-modal Hash retrieval method
US20200073968A1 (en) * 2018-09-04 2020-03-05 Inception Institute of Artificial Intelligence, Ltd. Sketch-based image retrieval techniques using generative domain migration hashing
CN111639240A (en) * 2020-05-14 2020-09-08 山东大学 Cross-modal Hash retrieval method and system based on attention awareness mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657350A (en) * 2015-03-04 2015-05-27 中国科学院自动化研究所 Hash learning method for short text integrated with implicit semantic features
US20200073968A1 (en) * 2018-09-04 2020-03-05 Inception Institute of Artificial Intelligence, Ltd. Sketch-based image retrieval techniques using generative domain migration hashing
CN110765281A (en) * 2019-11-04 2020-02-07 山东浪潮人工智能研究院有限公司 Multi-semantic depth supervision cross-modal Hash retrieval method
CN111639240A (en) * 2020-05-14 2020-09-08 山东大学 Cross-modal Hash retrieval method and system based on attention awareness mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WENMING CAO等: ""A Review of Hashing Methods for Multimodal Retrieval"", 《IEEE ACESS》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836068A (en) * 2021-03-24 2021-05-25 南京大学 Unsupervised cross-modal Hash retrieval method based on noisy label learning
CN112836068B (en) * 2021-03-24 2023-09-26 南京大学 Unsupervised cross-modal hash retrieval method based on noisy tag learning
CN113610080A (en) * 2021-08-04 2021-11-05 北京邮电大学 Cross-modal perception-based sensitive image identification method, device, equipment and medium
CN113610080B (en) * 2021-08-04 2023-08-25 北京邮电大学 Cross-modal perception-based sensitive image identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN112364198B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2022104540A1 (en) Cross-modal hash retrieval method, terminal device, and storage medium
Liu et al. Towards unsupervised deep graph structure learning
CN110162593B (en) Search result processing and similarity model training method and device
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN105022754B (en) Object classification method and device based on social network
CN111382868A (en) Neural network structure search method and neural network structure search device
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
US11423307B2 (en) Taxonomy construction via graph-based cross-domain knowledge transfer
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN112214775A (en) Injection type attack method and device for graph data, medium and electronic equipment
CN113298152B (en) Model training method, device, terminal equipment and computer readable storage medium
CN112364198B (en) Cross-modal hash retrieval method, terminal equipment and storage medium
CN112699375A (en) Block chain intelligent contract security vulnerability detection method based on network embedded similarity
CN115204886A (en) Account identification method and device, electronic equipment and storage medium
AlGarni et al. An efficient convolutional neural network with transfer learning for malware classification
US11822590B2 (en) Method and system for detection of misinformation
CN114826681A (en) DGA domain name detection method, system, medium, equipment and terminal
CN111310743B (en) Face recognition method and device, electronic equipment and readable storage medium
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN111444335B (en) Method and device for extracting central word
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
US20220230014A1 (en) Methods and systems for transfer learning of deep learning model based on document similarity learning
CN110909777A (en) Multi-dimensional feature map embedding method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant