CN112199375A - Cross-modal data processing method and device, storage medium and electronic device - Google Patents

Cross-modal data processing method and device, storage medium and electronic device Download PDF

Info

Publication number
CN112199375A
CN112199375A CN202011063096.8A CN202011063096A CN112199375A CN 112199375 A CN112199375 A CN 112199375A CN 202011063096 A CN202011063096 A CN 202011063096A CN 112199375 A CN112199375 A CN 112199375A
Authority
CN
China
Prior art keywords
modality
data
target
network model
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011063096.8A
Other languages
Chinese (zh)
Other versions
CN112199375B (en
Inventor
董西伟
严军荣
张小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sunwave Communications Co Ltd
Original Assignee
Sunwave Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sunwave Communications Co Ltd filed Critical Sunwave Communications Co Ltd
Priority to CN202011063096.8A priority Critical patent/CN112199375B/en
Publication of CN112199375A publication Critical patent/CN112199375A/en
Priority to PCT/CN2021/091215 priority patent/WO2022068196A1/en
Application granted granted Critical
Publication of CN112199375B publication Critical patent/CN112199375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a cross-modal data processing method, a cross-modal data processing device, a storage medium and an electronic device, wherein the method comprises the following steps: the method comprises the steps of obtaining query data of a first modality, respectively determining a target parameter between the query data of the first modality and each retrieval data of the second modality in a retrieval data set of the second modality to obtain a plurality of target parameters, determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters, and effectively associating the first modality with the second modality by using object characteristic data as a bridge, so that semantic gaps between different modalities can be relieved, the technical problem that cross-modality data processing is difficult to effectively realize in the related technology and the performance of a cross-modality data processing method is poor is solved, and the technical effects of improving the efficiency of cross-modality data processing and optimizing the cross-modality data processing performance are achieved.

Description

Cross-modal data processing method and device, storage medium and electronic device
Technical Field
The embodiment of the invention relates to the field of communication, in particular to a cross-mode data processing method and device, a storage medium and an electronic device.
Background
In practice, objects may be described with features from different modalities, for example, in social platforms such as WeChat, people often record some event that occurs using pictures and corresponding text. Cross-modality retrieval is intended to use an instance in one modality to retrieve an instance in another modality that is semantically similar to it, e.g., to retrieve documents related to it with an image. With the development of multimedia technology, the amount of multimodal data is also rapidly increasing. On a large-scale multi-modal dataset, how to accomplish information retrieval between different modalities is a very challenging problem. For the problem, the low storage cost and the high retrieval speed of the hash method cause the hash method to be widely concerned in the cross-modal retrieval field.
The inconsistency of the data distribution and data representation of different modalities makes it very difficult to directly perform similarity measurements between different modalities. This difficulty, which may also be referred to as the "modal gap," is a major obstacle affecting the performance of cross-modal hash retrieval. Due to the 'modal gap', the retrieval performance of the existing cross-modal hash method can not meet the requirements of people. Moreover, for the existing cross-modal hash retrieval methods based on the shallow structure, most of the existing cross-modal hash retrieval methods use manual features, and the features have no universality for different cross-modal retrieval tasks, so that the identification capability of hash codes learned by the existing cross-modal hash retrieval methods is limited, and further, the retrieval performance of the shallow cross-modal hash retrieval methods cannot reach the optimal performance.
Therefore, in the related art at present, in the process of performing cross-modal data processing, the efficiency of data processing is low, and the performance is far from meeting the user requirements.
Aiming at the technical problems that cross-modal data processing is difficult to realize efficiently and the performance of a method for performing the cross-modal data processing is poor in the related technology, an effective solution is not provided at present.
Disclosure of Invention
Embodiments of the present invention provide a cross-modal data processing method, device, storage medium, and electronic device, so as to at least solve the technical problem that it is difficult to efficiently implement cross-modal data processing in the related art, and the performance of a method for performing cross-modal data processing is poor.
According to an embodiment of the present invention, there is provided a cross-modality data processing method, including:
acquiring query data of a first mode; respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality contains a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used for indicating similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model by using a group of samples, and the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode; and determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters.
Optionally, before acquiring query data of the first modality, the method further comprises: acquiring a cross-modal dataset, wherein the cross-modal dataset comprises a training dataset and a testing dataset; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.
Optionally, acquiring a cross-modality data set comprises: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.
Optionally, training an initial neural network model using the training data set to obtain a target neural network model, including: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.
Optionally, inputting feature data of a first modality in the training data set and the object feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the object feature data into the second modality attention network model for training to obtain a trained second modality attention network model, including: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.
Optionally, constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model includes: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.
Optionally, determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters includes: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.
According to another embodiment of the present invention, there is provided a cross-modal data processing apparatus including: the acquisition module is used for acquiring query data of a first modality; a processing module, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities respectively to obtain a plurality of target parameters, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;
and the determining module is used for determining one or more retrieval data of the second modality into target data corresponding to the query data of the first modality according to the target parameters.
According to yet another embodiment of the invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program, when executed by a processor, performs the steps in any of the above method embodiments.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in any of the above method embodiments when executing the computer program.
According to the invention, the query data of the first mode is acquired, the target parameter between the query data of the first mode and each retrieval data of the second mode in the retrieval data set of the second mode is respectively determined to obtain a plurality of target parameters, one or more retrieval data of the second mode are determined as the target data corresponding to the query data of the first mode according to the target parameters, and the first mode and the second mode are effectively associated by using the object characteristic data as a bridge, so that semantic gap between different modes can be relieved, the technical problems that cross-mode data processing is difficult to effectively realize in the related technology and the performance of a cross-mode data processing method is poor can be solved, the efficiency of cross-mode data processing is improved, and the technical effect of the cross-mode data processing performance is optimized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative cross-modality data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;
FIG. 5 is a block diagram of an alternative cross-modality data processing apparatus, according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the present invention running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a cross-mode data processing method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to the cross-modal data processing method in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a cross-modal data processing method operating on a mobile terminal, a computer terminal, or a similar computing device is provided, fig. 2 is a schematic flowchart of an alternative cross-modal data processing method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
according to an embodiment of the present invention, there is provided a cross-modality data processing method, including:
s202, acquiring query data of a first mode;
s204, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the original data of the second modality into a target neural network model, the target parameter is used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model which are obtained by training based on the initial attention model, and a modality consistency model used for maintaining the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;
and S206, determining one or more second-mode retrieval data as target data corresponding to the query data of the first mode according to the target parameters.
Optionally, in the present embodiment, the first modality may include, but is not limited to, image, text, voice, video, motion capture, and the like. The second modality may include, but is not limited to, images, text, voice, video, motion capture, etc., and the first modality and the second modality are different modalities, for example, the first modality is images and the second modality is text, or the first modality is captured images and the second modality is images generated by simulation after motion capture.
Optionally, in this embodiment, the query data in the first modality may include, but is not limited to, a vector obtained by performing feature extraction on the data acquired in the first modality, and may also include, but is not limited to, a hash code generated by the vector obtained by performing feature extraction on the data acquired in the first modality.
Optionally, in this embodiment, the search data in the second modality may include, but is not limited to, a vector obtained by performing feature extraction on the data acquired in the second modality, and may further include, but is not limited to, a hash code generated by the vector obtained by performing feature extraction on the data acquired in the second modality, where the search data set in the second modality is a set composed of a plurality of predetermined search data in the second modality.
Optionally, in this embodiment, the target parameter may include, but is not limited to, a hamming distance between the hash code corresponding to the query data in the first modality and the hash code corresponding to the search data in the second modality, and the similarity may include, but is not limited to, representing by comparing magnitudes of the hamming distances, where the hamming distance is negatively related to the similarity, that is, the smaller the hamming distance is, the more similar the query data in the first modality and the search data in the second modality are.
Optionally, in this embodiment, the target neural network model may include, but is not limited to, one or more attention mechanism configuration-based neural network models, one or more convolutional neural network models, one or more modal consistency models, and may include, but is not limited to, a combination of one or more of the above.
Optionally, in this embodiment, the object feature data may include, but is not limited to, object feature data extracted from an image acquired by an image acquisition device through an image detection algorithm.
According to the embodiment, query data of a first modality are obtained, a target parameter between the query data of the first modality and each retrieval data of a second modality in a retrieval data set of the second modality is respectively determined to obtain a plurality of target parameters, one or more retrieval data of the second modality are determined to be target data corresponding to the query data of the first modality according to the target parameters, object feature data are used as a bridge to effectively associate the first modality with the second modality, semantic gaps among different modalities can be further relieved, the technical problem that cross-modality data processing is difficult to effectively achieve in the related technology and the performance of a cross-modality data processing method is poor can be solved, the efficiency of cross-modality data processing is improved, and the technical effect of the cross-modality data processing performance is optimized.
In an optional embodiment, the method further comprises: before query data of a first modality is acquired, acquiring a cross-modality data set, wherein the cross-modality data set comprises a training data set and a testing data set; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.
Optionally, in this embodiment, the hash representation may include, but is not limited to, establishing a valid semantic association between the image modality data and the text modality data, and is
Figure BDA0002712972390000101
And
Figure BDA0002712972390000102
learning unified hash representations
Figure BDA0002712972390000103
And
Figure BDA0002712972390000104
and further, cross-modal data processing is performed by using uniform hash representation.
Optionally, in this embodiment, for example, the first modality is an image modality, and the second modality is a text modality, and the media types included in the cross-modality data set are images and texts. For the image, pixel features are used as original input features in the network, and a convolutional neural network structure based on VGGNet-19 is used as a feature extractor, so that the method also supports other convolutional neural network structures for image feature extraction; for text, Word Embedding (Word Embedding) vectors are used as original input features, and Long Short-Term Memory (LSTM) neural networks are used as feature extractors.
In an alternative embodiment, acquiring a cross-modality data set includes: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.
Optionally, in this embodiment, taking the first modality as an image modality and the second modality as a text modality as an example, the use of the first modality may include, but is not limited to
Figure BDA0002712972390000105
A set of pixel feature vectors representing n objects in an image modality, wherein viAnd representing the pixel characteristic vector of the ith object in the image modality. Order to
Figure BDA0002712972390000106
A feature vector representing the n objects in a text modality, wherein tiAnd representing the feature vector of the ith object in the text mode. A class label vector of n objects (corresponding to the aforementioned object feature data) is represented as
Figure BDA0002712972390000107
Where c represents the number of object classes. For vector yiIn other words, if the ith object belongs to the kth class, let vector yiIs 1, otherwise let vector yiIs 0. After using the object attention model, for the ith object, let
Figure BDA0002712972390000108
Represents its output characteristics in an image modality, whereinvUnknown parameters of the image modality; order to
Figure BDA0002712972390000111
Representing its output characteristics in a text modality, whereintThe feature vector sets V and T are training data sets for unknown parameters of the text mode.
Alternatively, in the present embodiment, use is made of
Figure BDA0002712972390000112
Feature vectors representing a query sample of an image modality, feature vectors using a query sample of a text modality
Figure BDA0002712972390000113
Representing the feature vector set of the image mode sample in the test data set as
Figure BDA0002712972390000114
The feature vector set of the text mode sample in the test data set is
Figure BDA0002712972390000115
Wherein the content of the first and second substances,
Figure BDA0002712972390000116
representing the number of samples in the test data set.
In an alternative embodiment, training an initial neural network model using the training data set to obtain a target neural network model comprises: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.
Optionally, in this embodiment, for example, the first modality is an image modality, and the second modality is a text modality, the initial neural network model includes, but is not limited to, a subject attention network model and a modality consistency model. The object attention network model comprises: image attention networks and text attention networks. The object attention network model takes the characteristics of the image object obtained by target detection as a link, and uses the image attention network and the text attention network to connect the image mode and the text mode from high-level semantics, so that the retrieval accuracy of Hash expression is improved. In the process of learning the Hash expression of the cross-modal data, the modal consistency model enables the learned Hash expression to keep the inter-modal and intra-modal consistency of the original cross-modal data, so that the neighbor relation of the Hash expression is constrained in the original neighbor topological framework, the original semantic relation of different modal data is maintained, and the retrieval precision is improved.
It should be noted that, for example, the first modality is an image modality, and the second modality is a text modality, where the object attention network model includes: the image attention network and the text attention network, and the object attention network model has three input data: image modal data, text modal data, image object data obtained by target detection. The image modal data is trained on the basis of receiving the migration knowledge by using an image attention network; the text modality data is trained using a text attention network. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link.
In an optional embodiment, inputting feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, including: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.
Optionally, in this embodiment, the image attention network migrates knowledge learned from the ImageNet classification task to VGGNet-19 and migrates the output of the last pooling layer of VGGNet-19
Figure BDA0002712972390000131
As an input feature of the image attention model, R is the number of image partitions. The image attention model (corresponding to the aforementioned first modality attention network model) maps the features O of each objectmAnd image partition characteristics
Figure BDA0002712972390000132
Inputting to a single-layer neural network and generating an attention distribution of an image on different image partitions by using a softmax function (corresponding to the aforementioned first preset function), namely:
Figure BDA0002712972390000133
wherein, "; "denotes the concatenation of vectors.
Figure BDA0002712972390000134
Is represented in a given object feature OmThe attention probability of each image partition. Based on attention distribution
Figure BDA0002712972390000135
And IjThe corresponding new feature vector is
Figure BDA0002712972390000136
(corresponding to the first target feature vector described earlier).
Optionally, in this embodiment, the text attention network (corresponding to the aforementioned second modality attention network model) extracts text features using an LSTM (long short term memory) network
Figure BDA0002712972390000137
And by using formulae
Figure BDA0002712972390000138
Figure BDA0002712972390000139
(corresponding to the second predetermined function described above) may be obtained asiCorresponding new feature vector
Figure BDA00027129723900001310
(corresponding to the second target feature vector described previously).
In addition, let
Figure BDA00027129723900001311
Representing the output characteristics of n objects after the image modality is processed by two fully-connected layers, wherein thetavUnknown parameters (corresponding to the aforementioned first preset parameters) of the image modality; order to
Figure BDA00027129723900001312
Representing the output characteristics of n objects after the text mode is processed by two fully-connected layers, wherein thetatIs an unknown parameter of the text modality (corresponding to the aforementioned second preset parameter).
Suppose the characteristic f (v) of the ith object in the image modality and the text modalityi;θv) And g (t)i;θt) Separately generating Hash codes in Hamming space
Figure BDA00027129723900001313
And
Figure BDA00027129723900001314
then, cross-modal hash learning can be performed by optimizing the following loss function:
Figure BDA00027129723900001315
Figure BDA0002712972390000141
wherein
Figure BDA0002712972390000142
And
Figure BDA0002712972390000143
for a k-bit hash code consisting of-1 and +1, the vectors of the ith columns of matrices F and G are respectively F (v)i;θv) And g (t)i;θt)。
In an optional embodiment, constraining the feature data of the first modality and the feature data of the second modality using the initial modality consistency model based on the semantic information of the feature data of the first modality and the semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model includes: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.
Optionally, in this embodiment, the modal consistency model is used to make the learned hash representation maintain inter-modal and intra-modal consistency of the original cross-modal data, so as to constrain the neighbor relation of the hash representation in the original neighbor topology framework, so as to maintain the original semantic relation of the different modal data. Modal coherence maintenance can be achieved by optimizing the following penalty function (corresponding to the target penalty function described previously):
Figure BDA0002712972390000144
wherein L is D-W,
Figure BDA0002712972390000145
denotes the ith diagonal element, w, of the diagonal matrix DijFor the element in the ith row and the jth column of the matrix W, B ═ B1,b2,...,bn]T∈{-1,+1}n×k
Figure BDA0002712972390000146
trace (·) represents the trace of the matrix,
Figure BDA0002712972390000147
Figure BDA0002712972390000148
namely the aboveThe first hash-code is encoded by the first hash-code,
Figure BDA0002712972390000149
namely the second hash code.
Figure BDA00027129723900001410
The mahalanobis distance between two data points of the image mode and the text mode respectively,
Figure BDA00027129723900001411
the Euclidean distance between two data points of an image mode and a text mode respectively, and lambda and beta are distance measurement balance factors. When the ith data point of the image modality and the jth data point of the text modality have the same semantic label, enabling CijOtherwise, let C be 10ij0. Because the network architecture of the method is an end-to-end structure, the image attention network and the text attention network can be jointly trained through a back propagation algorithm.
In an optional embodiment, determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters comprises: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.
Optionally, in this embodiment, since the object attention network model is trained in (2), based on the trained model parameters, query data of one modality in a given test data set is subjected to one-time forward propagation calculation in the network, so that a uniform hash representation of the query data can be obtained. The similarity across modal data depends on the hamming distance between the uniform hash representations in hamming space. The smaller the hamming distance between the unified hash representations, the greater the similarity. In the present embodiment, the similarity between the image modality data and the text modality data is reflected by calculating a hamming distance between the unified hamming representations of the two.
Optionally, in this embodiment, one mode in the cross-mode test data set is used as a query data set, and the other mode is used as a retrieval data set, and cross-mode hash retrieval is performed to obtain a final retrieval result according to the similarity between the query data and the retrieval data.
The invention is further explained below with reference to specific examples:
aiming at the defects of the prior art, the invention provides a cross-modal Hash retrieval method based on an attention model, which can unify an object attention model and a modal consistency keeping model in a network architecture and realize the effective association of cross-modal data at a high-level semantic level. The method takes object features obtained by target detection as links, uses an image attention network and a text attention network to connect image modalities and text modalities from high-level semantics, restrains the neighbor relation of Hash representation in an original neighbor topological framework through a modality consistency model, maintains the original semantic relation of different modality data, enables the obtained Hash representation to be more suitable for cross-modality retrieval tasks, and improves the accuracy of cross-modality Hash retrieval.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a cross-modal hash retrieval method based on an attention model, which is used for learning a unified hash representation of data in different modalities in a unified network architecture, so as to implement cross-modal retrieval, fig. 3 is a flowchart of an optional cross-modal data processing method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:
s302, constructing a cross-modal data set, and dividing data in the cross-modal data set into a training data set and a testing data set;
s304, training a cross-modal Hash learning model based on an attention model by using training data in a cross-modal data set, and learning uniform Hash expression for data in different modalities;
s306, obtaining uniform Hash expression of test data in the cross-modal data set by using the trained cross-modal Hash learning model parameters based on the attention model, and further calculating the similarity of the cross-modal data;
and S308, performing cross-modal Hash retrieval by taking one mode in the cross-modal test data set as a query data set and the other mode as a retrieval data set, and obtaining a final retrieval result according to the similarity between the query data and the retrieval data.
Further, a cross-modal hash retrieval method based on an attention model, where the cross-modal data set in step (1) includes two modality types, specifically, an image modality type and a text modality type.
Further, a cross-modal hash retrieval method based on an attention model, wherein the cross-modal hash learning model based on the attention model in the step (2) includes an object attention network model and a modal consistency model fused in a unified network architecture. The object attention network model comprises: image attention networks and text attention networks. The object attention network model takes the characteristics of the image object obtained by target detection as a link, and uses the image attention network and the text attention network to connect the image mode and the text mode from high-level semantics, so that the retrieval accuracy of Hash expression is improved. In the process of learning the Hash expression of the cross-modal data, the modal consistency model enables the learned Hash expression to keep the inter-modal and intra-modal consistency of the original cross-modal data, so that the neighbor relation of the Hash expression is constrained in the original neighbor topological framework, the original semantic relation of different modal data is maintained, and the retrieval precision is improved.
Further, in the attention model-based cross-modal hash retrieval method, the similarity of the cross-modal data in the step (3) depends on the hamming distance between the hash representations, and the smaller the hamming distance is, the greater the similarity is.
Further, a cross-modal hash retrieval method based on an attention model is characterized in that the cross-modal hash retrieval mode in the step (4) is that data of one type of modality is arbitrarily selected from the test data set in the step (1) to serve as a query sample, similarity calculation is carried out on all data of the other type of modality in the test set according to the cross-modal similarity calculation method in the step (3), then the data are sorted from large to small according to the similarity, and a retrieval result list is returned.
The invention has the following effects: compared with the existing method, the method can unify the object attention model and the modal consistency keeping model in one network architecture, and realize the effective association of cross-modal data in a high-level semantic level. In the process of uniform Hash expression learning of different modal data, the method utilizes an attention model to mine the semantic association of the multi-modal data, and utilizes a modal consistency model to keep the semantic association of the multi-modal data, thereby improving the accuracy of cross-modal retrieval.
The reason why the method has the above-mentioned inventive effect is that: the method unifies an object attention model and a modal consistency maintenance model into a network architecture. The attention model uses the object characteristics obtained by target detection as links, and uses an image attention network and a text attention network to semantically link an image modality and a text modality from a high level. The modal consistency model restrains the neighbor relation expressed by the Hash in the original neighbor topological framework, so that the original semantic relation of different modal data is maintained. The end-to-end network architecture formed by the two submodels fully excavates the semantic association of data in different modes, fully maintains the original semantic association of the data in different modes, promotes the learning of multi-mode data uniform hash representation, and improves the accuracy of cross-mode retrieval.
The attention model-based cross-modal hash retrieval method specifically includes, but is not limited to, the following:
(1) and constructing a cross-modal data set, and simultaneously dividing data in the cross-modal data set into a training data set and a testing data set.
In this embodiment, the media types included in the cross-modal dataset are images and texts. For the image, pixel features are used as original input features in the network, and a convolutional neural network structure based on VGGNet-19 is used as a feature extractor, so that the method also supports other convolutional neural network structures for image feature extraction; for text, Word Embedding (Word Embedding) vectors are used as original input features, and Long Short-Term Memory (LSTM) neural networks are used as feature extractors.
Use of
Figure BDA0002712972390000181
A set of pixel feature vectors representing n objects in an image modality, wherein viAnd representing the pixel characteristic vector of the ith object in the image modality. Order to
Figure BDA0002712972390000182
A feature vector representing the n objects in a text modality, wherein tiAnd representing the feature vector of the ith object in the text mode. Representing a class label vector of n objects as
Figure BDA0002712972390000183
Where c represents the number of object classes. For vector yiIn other words, if the ith object belongs to the kth class, let vector yiIs 1, otherwise let vector yiIs 0. After using the object attention model, for the ith object, let
Figure BDA0002712972390000184
Represents its output characteristics in an image modality, whereinvUnknown parameters of the image modality; order to
Figure BDA0002712972390000185
Representing its output characteristics in a text modality, whereintIs an unknown parameter of the text modality. The feature vector sets V and T are training data sets. Use of
Figure BDA0002712972390000186
Feature vectors representing a query sample of an image modality, feature vectors using a query sample of a text modality
Figure BDA0002712972390000187
Representing the feature vector set of the image mode sample in the test data set as
Figure BDA0002712972390000188
The feature vector set of the text mode sample in the test data set is
Figure BDA0002712972390000189
Wherein the content of the first and second substances,
Figure BDA00027129723900001810
representing the number of samples in the test data set.
The goal of learning is to establish an effective semantic association between image modality data and text modality data by fusing an object attention model and a modality consistency model, and to do so for
Figure BDA00027129723900001811
And
Figure BDA00027129723900001812
learning unified hash representations
Figure BDA0002712972390000191
And
Figure BDA0002712972390000192
and further, a unified Hash representation is utilized to perform a cross-modal retrieval task.
(2) Training an attention model-based cross-modal hash learning model using training data in a cross-modal dataset for learning a uniform hash representation for data of different modalities.
The network structure constructed in this step is shown in fig. 4, and the attention model-based cross-modal hash learning model of the present invention includes an object attention network model and a modal consistency model fused in a unified network architecture. Wherein the object attention network model comprises: the image attention network 402 and the text attention network 404, and the subject attention network model shares three inputs: image modality data 406, text modality data 408, and target detected image object data 410. The image modal data is trained on the basis of receiving the migration knowledge by using an image attention network; the text modality data is trained using a text attention network. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link.
In this embodiment, the image attention network migrates knowledge learned from the ImageNet classification task to VGGNet-19 and migrates the output of the last pooling layer of VGGNet-19
Figure BDA0002712972390000193
As an input feature of the image attention model, R is the number of image partitions. The image attention model maps the features O of each objectmAnd image partition characteristics
Figure BDA0002712972390000194
Inputting the attention distribution of an image on different image partitions by using a softmax function, namely:
Figure BDA0002712972390000195
Figure BDA0002712972390000196
wherein, "; "denotes the concatenation of vectors.
Figure BDA0002712972390000197
Is represented in a given object feature OmThe attention probability of each image partition. Based on attention distribution
Figure BDA0002712972390000198
And IjThe corresponding new feature vector is
Figure BDA0002712972390000199
Text attention network extraction of text features using the LSTM (long short term memory) network
Figure BDA00027129723900001910
And by using formulae
Figure BDA00027129723900001911
Figure BDA00027129723900001912
Can be obtained with HiCorresponding new feature vector
Figure BDA00027129723900001913
Order to
Figure BDA00027129723900001914
Representing the output characteristics of n objects after the image modality is processed by two fully-connected layers, wherein thetavUnknown parameters of the image modality; order to
Figure BDA00027129723900001915
Representing the output characteristics of n objects after the text mode is processed by two fully-connected layers, wherein thetatIs an unknown parameter of the text modality.
Suppose the characteristic f (v) of the ith object in the image modality and the text modalityi;θv) And g (t)i;θt) Separately generating Hash codes in Hamming space
Figure BDA0002712972390000201
And
Figure BDA0002712972390000202
then, the following loss function can be optimized
Figure BDA0002712972390000203
Wherein
Figure BDA0002712972390000204
And
Figure BDA0002712972390000205
for a k-bit hash code consisting of-1 and +1, the vectors of the ith columns of matrices F and G are respectively F (v)i;θv) And g (t)i;θt)。
In this embodiment, the modal consistency model 412 is used to make the learned hash representation maintain inter-modal and intra-modal consistency of the original cross-modal data, so as to constrain the neighbor relation of the hash representation in the original neighbor topology framework, so as to maintain the original semantic relation of the data in different modalities. Modal consistency preservation can be achieved by optimizing the loss function as follows:
Figure BDA0002712972390000206
wherein L is D-W,
Figure BDA0002712972390000207
denotes the ith diagonal element, w, of the diagonal matrix DijFor the element in the ith row and the jth column of the matrix W, B ═ B1,b2,...,bn]T∈{-1,+1}n×k
Figure BDA0002712972390000208
trace (·) represents the trace of the matrix,
Figure BDA0002712972390000209
Figure BDA00027129723900002010
ma's between two data points of image mode and text modeThe distance between the first and second electrodes,
Figure BDA00027129723900002011
the Euclidean distance between two data points of an image mode and a text mode respectively, and lambda and beta are distance measurement balance factors. When the ith data point of the image modality and the jth data point of the text modality have the same semantic label, enabling CijOtherwise, let C be 10ij0. Because the network architecture of the method is an end-to-end structure, the image attention network and the text attention network can be jointly trained through a back propagation algorithm.
(3) And obtaining uniform Hash expression of the test data in the cross-modal data set by using the trained cross-modal Hash learning model parameters based on the attention model, and further calculating the similarity of the cross-modal data.
After the object attention network model is trained in the step (2), based on trained model parameters, data of one mode in a test data set is given, and forward propagation calculation is performed once in the network, so that uniform hash representation of the data can be obtained. The similarity across modal data depends on the hamming distance between the uniform hash representations in hamming space. The smaller the hamming distance between the unified hash representations, the greater the similarity. In the present embodiment, the similarity between the image modality data and the text modality data is reflected by calculating a hamming distance between the unified hamming representations of the two.
(4) And taking one mode in the cross-mode test data set as a query data set and the other mode as a retrieval data set, performing cross-mode Hash retrieval, and obtaining a final retrieval result according to the similarity of the query data and the retrieval data.
The following experimental results show that compared with the existing method, the cross-modal Hash retrieval method based on the attention model can achieve higher retrieval accuracy.
The following describes the advantageous effects of the present invention with reference to specific experiments.
This example performed experiments on the Pascal VOC 2007 dataset. The Pascal VOC 2007 dataset contained 9963 images from 20 categories, each labeled. The data set was divided into a training set containing 5011 image-label pairs and a test set containing 4952 image-label pairs. The image modality uses the original pixel characteristics as input features. The text modality uses word-embedded vectors as input features. The experiment comprises two cross-modal retrieval tasks of image retrieval text and image retrieval by using the text. The results reported are the average of the results obtained from 10 random experiments. The following 3 methods were tested as experimental comparisons:
the prior method comprises the following steps: the Semantic depth Cross-modal Hashing method in the document "Semantic Deep Cross-modular Hashing" (author q.lin, w.cao, z.he, and z.he) improves a feature learning part by constructing Semantic label branches, so that the learned features can maintain Semantic information.
The prior method II comprises the following steps: the method is a depth Joint semantic reconstruction Hashing method in the document 'Deep Joint-correlation Hashing for Large-Scale Unverended Cross-Modal Retrieval' (authors S.Su, Z.ZHong, and C.ZHang), and a Joint semantic affine matrix constructed by the method can skillfully fuse original neighbor information from different modes.
The existing method is three: the Deep multi-scale Fusion Hashing method in the document 'Deep Multiscale Fusion Hashing for Cross-modular Retrieval' (author x.nie, b.wang, j.li, f.hao, m.jian, and y.yin) first designs different network branches for two modalities, and then fuses semantics of multiple scales by adopting a multi-scale Fusion model on each branch network to better mine semantic relevance.
In the experiment, the accuracy of the cross-modal retrieval is evaluated by using a MAP (mean Average precision) value which is commonly used in the field of information retrieval as an index, and the larger the MAP value is, the better the result of the cross-modal retrieval is.
Table 1 experimental results of the present invention and prior methods show
Figure BDA0002712972390000221
As can be seen from Table 1, the invention achieves an improvement in the retrieval accuracy in both the tasks of image retrieval of text and image retrieval of text compared with the existing method. In the first comparison method, the semantic label is used to maintain the semantic association among the original multi-modal data in the learned features, and the high-level semantic association mining among the multi-modal data is not sufficient. The second contrast method and the third contrast method focus on fusing semantic information of different modes from different layers, but the original semantic association of multi-mode data is not sufficiently considered. The invention unifies the object attention model and the modality consistency maintenance model into an end-to-end network architecture. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link. The modal consistency model restrains the hash expressed neighbor relation in the original neighbor topology framework so as to maintain the original semantic relation of different modal data. The object attention model and the modal consistency keeping model fully mine and maintain high-level semantic information among multi-modal data, promote the learning of unified Hash expression of the multi-modal data, and improve the accuracy of cross-modal retrieval.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a cross-mode data processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 5 is a block diagram of an alternative cross-modality data processing apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:
an obtaining module 502, configured to obtain query data in a first modality;
a processing module 504, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities to obtain a plurality of target parameters, respectively, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;
a determining module 506, configured to determine one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters.
In an optional embodiment, the apparatus is further configured to: before query data of a first modality is acquired, acquiring a cross-modality data set, wherein the cross-modality data set comprises a training data set and a testing data set; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.
In an alternative embodiment, the apparatus is configured to acquire the cross-modal dataset by: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.
In an alternative embodiment, the apparatus is configured to train an initial neural network model using the training data set to obtain a target neural network model by: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.
In an optional embodiment, the apparatus is configured to input feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and input feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, as follows: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.
In an alternative embodiment, the apparatus is configured to use the initial modality consistency model to constrain the feature data of the first modality and the feature data of the second modality based on the semantic information of the feature data of the first modality and the semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model by: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.
In an alternative embodiment, the apparatus is configured to determine one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters by: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.
In the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring query data of a first mode;
s2, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the raw data of the second modality into a target neural network model, the target parameters are used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, and a modality consistency model used for keeping the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;
and S3, determining one or more second-mode retrieval data as target data corresponding to the first-mode query data according to the target parameters.
The computer readable storage medium is further arranged to store a computer program for performing the steps of:
s1, acquiring query data of a first mode;
s2, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the raw data of the second modality into a target neural network model, the target parameters are used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, and a modality consistency model used for keeping the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;
and S3, determining one or more second-mode retrieval data as target data corresponding to the first-mode query data according to the target parameters.
In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
In an exemplary embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring query data of a first mode;
s2, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the raw data of the second modality into a target neural network model, the target parameters are used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, and a modality consistency model used for keeping the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;
and S3, determining one or more second-mode retrieval data as target data corresponding to the first-mode query data according to the target parameters.
For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.
It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A cross-mode data processing method is characterized by comprising the following steps:
acquiring query data of a first mode;
respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality contains a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used for indicating similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model by using a group of samples, and the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;
and determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters.
2. The method of claim 1, wherein prior to acquiring query data of the first modality, the method further comprises:
acquiring a cross-modal dataset, wherein the cross-modal dataset comprises a training dataset and a testing dataset;
training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities;
inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data;
determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.
3. The method of claim 2, wherein acquiring a cross-modality data set comprises:
extracting a feature data set of a first modality using a convolutional neural network;
extracting a feature data set of a second modality by using a long-short term memory neural network;
determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set;
determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.
4. The method of claim 2, wherein training an initial neural network model using the training data set to obtain a target neural network model comprises:
inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model;
determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model;
constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model;
determining the target object attention neural network model and the target modality consistency model as the target neural network model.
5. The method according to claim 4, wherein inputting feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, comprises:
under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data;
determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model;
and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.
6. The method according to claim 5, wherein constraining the feature data of the first modality and the feature data of the second modality using the initial modality consistency model based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model comprises:
performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space;
inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.
7. The method of claim 1, wherein determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters comprises:
determining a third hash code corresponding to the query data of the first modality;
querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality;
calculating a hamming distance for each hash code of the third hash code and the set of hash codes;
and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.
8. A cross-modal data processing apparatus, comprising:
the acquisition module is used for acquiring query data of a first modality;
a processing module, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities respectively to obtain a plurality of target parameters, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;
and the determining module is used for determining one or more retrieval data of the second modality into target data corresponding to the query data of the first modality according to the target parameters.
9. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method as claimed in any of claims 1 to 7 are implemented when the computer program is executed by the processor.
CN202011063096.8A 2020-09-30 2020-09-30 Cross-modal data processing method and device, storage medium and electronic device Active CN112199375B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011063096.8A CN112199375B (en) 2020-09-30 2020-09-30 Cross-modal data processing method and device, storage medium and electronic device
PCT/CN2021/091215 WO2022068196A1 (en) 2020-09-30 2021-04-29 Cross-modal data processing method and device, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011063096.8A CN112199375B (en) 2020-09-30 2020-09-30 Cross-modal data processing method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN112199375A true CN112199375A (en) 2021-01-08
CN112199375B CN112199375B (en) 2024-03-01

Family

ID=74013562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011063096.8A Active CN112199375B (en) 2020-09-30 2020-09-30 Cross-modal data processing method and device, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN112199375B (en)
WO (1) WO2022068196A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817914A (en) * 2021-01-21 2021-05-18 深圳大学 Attention-based deep cross-modal Hash retrieval method and device and related equipment
CN112925936A (en) * 2021-02-22 2021-06-08 济南大学 Motion capture data retrieval method and system based on deep hash
CN113076433A (en) * 2021-04-26 2021-07-06 支付宝(杭州)信息技术有限公司 Retrieval method and device for retrieval object with multi-modal information
WO2022068196A1 (en) * 2020-09-30 2022-04-07 三维通信股份有限公司 Cross-modal data processing method and device, storage medium, and electronic device
CN114625971A (en) * 2022-05-12 2022-06-14 湖南工商大学 Interest point recommendation method and device based on user sign-in
WO2023168997A1 (en) * 2022-03-07 2023-09-14 腾讯科技(深圳)有限公司 Cross-modal retrieval method and related device

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840734B (en) * 2022-04-29 2023-04-25 北京百度网讯科技有限公司 Training method of multi-modal representation model, cross-modal retrieval method and device
CN114842312B (en) * 2022-05-09 2023-02-10 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN114691907B (en) * 2022-05-31 2022-09-16 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN115080699A (en) * 2022-07-04 2022-09-20 福州大学 Cross-modal retrieval method based on modal specific adaptive scaling and attention network
CN115098620B (en) * 2022-07-26 2024-03-29 北方民族大学 Cross-modal hash retrieval method for attention similarity migration
CN115081627B (en) * 2022-07-27 2022-11-25 中南大学 Cross-modal data hash retrieval attack method based on generative network
CN117033720A (en) * 2022-09-01 2023-11-10 腾讯科技(深圳)有限公司 Model training method, device, computer equipment and storage medium
CN115203442B (en) * 2022-09-15 2022-12-20 中国海洋大学 Cross-modal deep hash retrieval method, system and medium based on joint attention
CN115861995B (en) * 2023-02-08 2023-05-23 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN115880556B (en) * 2023-02-21 2023-05-02 北京理工大学 Multi-mode data fusion processing method, device, equipment and storage medium
CN116431788B (en) * 2023-04-14 2024-03-29 中电科大数据研究院有限公司 Cross-modal data-oriented semantic retrieval method
CN116127123B (en) * 2023-04-17 2023-07-07 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116881482A (en) * 2023-06-27 2023-10-13 四川九洲视讯科技有限责任公司 Cross-media intelligent sensing and analyzing processing method for public safety data
CN116578729B (en) * 2023-07-13 2023-11-28 腾讯科技(深圳)有限公司 Content search method, apparatus, electronic device, storage medium, and program product
CN117112852B (en) * 2023-10-25 2024-02-20 卓世科技(海南)有限公司 Large language model driven vector database retrieval method and system
CN117194605B (en) * 2023-11-08 2024-01-19 中南大学 Hash encoding method, terminal and medium for multi-mode medical data deletion
CN117392396B (en) * 2023-12-08 2024-03-05 安徽蔚来智驾科技有限公司 Cross-modal target state detection method, device, intelligent device and medium
CN117611845B (en) * 2024-01-24 2024-04-26 浪潮通信信息系统有限公司 Multi-mode data association identification method, device, equipment and storage medium
CN117688193B (en) * 2024-02-01 2024-05-31 湘江实验室 Picture and text unified coding method, device, computer equipment and medium
CN117994470B (en) * 2024-04-07 2024-06-07 之江实验室 Multi-mode hierarchical self-adaptive digital grid reconstruction method and device
CN118097686A (en) * 2024-04-25 2024-05-28 支付宝(杭州)信息技术有限公司 Multi-mode multi-task medical large model training method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019652A (en) * 2019-03-14 2019-07-16 九江学院 A kind of cross-module state Hash search method based on deep learning
US20200097604A1 (en) * 2018-09-21 2020-03-26 Microsoft Technology Licensing, Llc Stacked cross-modal matching
CN111639240A (en) * 2020-05-14 2020-09-08 山东大学 Cross-modal Hash retrieval method and system based on attention awareness mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199375B (en) * 2020-09-30 2024-03-01 三维通信股份有限公司 Cross-modal data processing method and device, storage medium and electronic device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200097604A1 (en) * 2018-09-21 2020-03-26 Microsoft Technology Licensing, Llc Stacked cross-modal matching
CN110019652A (en) * 2019-03-14 2019-07-16 九江学院 A kind of cross-module state Hash search method based on deep learning
CN111639240A (en) * 2020-05-14 2020-09-08 山东大学 Cross-modal Hash retrieval method and system based on attention awareness mechanism

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022068196A1 (en) * 2020-09-30 2022-04-07 三维通信股份有限公司 Cross-modal data processing method and device, storage medium, and electronic device
CN112817914A (en) * 2021-01-21 2021-05-18 深圳大学 Attention-based deep cross-modal Hash retrieval method and device and related equipment
CN112925936A (en) * 2021-02-22 2021-06-08 济南大学 Motion capture data retrieval method and system based on deep hash
CN112925936B (en) * 2021-02-22 2022-08-12 济南大学 Motion capture data retrieval method and system based on deep hash
CN113076433A (en) * 2021-04-26 2021-07-06 支付宝(杭州)信息技术有限公司 Retrieval method and device for retrieval object with multi-modal information
CN113076433B (en) * 2021-04-26 2022-05-17 支付宝(杭州)信息技术有限公司 Retrieval method and device for retrieval object with multi-modal information
WO2023168997A1 (en) * 2022-03-07 2023-09-14 腾讯科技(深圳)有限公司 Cross-modal retrieval method and related device
CN114625971A (en) * 2022-05-12 2022-06-14 湖南工商大学 Interest point recommendation method and device based on user sign-in
CN114625971B (en) * 2022-05-12 2022-09-09 湖南工商大学 Interest point recommendation method and device based on user sign-in

Also Published As

Publication number Publication date
CN112199375B (en) 2024-03-01
WO2022068196A1 (en) 2022-04-07

Similar Documents

Publication Publication Date Title
CN112199375B (en) Cross-modal data processing method and device, storage medium and electronic device
CN101587478B (en) Methods and devices for training, automatically labeling and searching images
CN109271539B (en) Image automatic labeling method and device based on deep learning
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
CN108319888B (en) Video type identification method and device and computer terminal
CN112199462A (en) Cross-modal data processing method and device, storage medium and electronic device
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words
CN110968684A (en) Information processing method, device, equipment and storage medium
CN111931859B (en) Multi-label image recognition method and device
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
CN113255714A (en) Image clustering method and device, electronic equipment and computer readable storage medium
CN114996511A (en) Training method and device for cross-modal video retrieval model
CN113239290A (en) Data analysis method and device for public opinion monitoring and electronic device
CN114398973B (en) Media content tag identification method, device, equipment and storage medium
CN110659392B (en) Retrieval method and device, and storage medium
CN116109732A (en) Image labeling method, device, processing equipment and storage medium
CN113254687B (en) Image retrieval and image quantification model training method, device and storage medium
CN114490923A (en) Training method, device and equipment for similar text matching model and storage medium
CN115687676B (en) Information retrieval method, terminal and computer-readable storage medium
CN115129976B (en) Resource recall method, device, equipment and storage medium
CN116796288A (en) Industrial document-oriented multi-mode information extraction method and system
CN114329016B (en) Picture label generating method and text mapping method
CN115830342A (en) Method and device for determining detection frame, storage medium and electronic device
CN111062199A (en) Bad information identification method and device
CN115168609A (en) Text matching method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant