CN112199375A

CN112199375A - Cross-modal data processing method and device, storage medium and electronic device

Info

Publication number: CN112199375A
Application number: CN202011063096.8A
Authority: CN
Inventors: 董西伟; 严军荣; 张小龙
Original assignee: Sunwave Communications Co Ltd
Current assignee: Sunwave Communications Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-08
Anticipated expiration: 2040-09-30
Also published as: CN112199375B; WO2022068196A1

Abstract

The embodiment of the invention provides a cross-modal data processing method, a cross-modal data processing device, a storage medium and an electronic device, wherein the method comprises the following steps: the method comprises the steps of obtaining query data of a first modality, respectively determining a target parameter between the query data of the first modality and each retrieval data of the second modality in a retrieval data set of the second modality to obtain a plurality of target parameters, determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters, and effectively associating the first modality with the second modality by using object characteristic data as a bridge, so that semantic gaps between different modalities can be relieved, the technical problem that cross-modality data processing is difficult to effectively realize in the related technology and the performance of a cross-modality data processing method is poor is solved, and the technical effects of improving the efficiency of cross-modality data processing and optimizing the cross-modality data processing performance are achieved.

Description

Cross-modal data processing method and device, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a cross-mode data processing method and device, a storage medium and an electronic device.

Background

In practice, objects may be described with features from different modalities, for example, in social platforms such as WeChat, people often record some event that occurs using pictures and corresponding text. Cross-modality retrieval is intended to use an instance in one modality to retrieve an instance in another modality that is semantically similar to it, e.g., to retrieve documents related to it with an image. With the development of multimedia technology, the amount of multimodal data is also rapidly increasing. On a large-scale multi-modal dataset, how to accomplish information retrieval between different modalities is a very challenging problem. For the problem, the low storage cost and the high retrieval speed of the hash method cause the hash method to be widely concerned in the cross-modal retrieval field.

The inconsistency of the data distribution and data representation of different modalities makes it very difficult to directly perform similarity measurements between different modalities. This difficulty, which may also be referred to as the "modal gap," is a major obstacle affecting the performance of cross-modal hash retrieval. Due to the 'modal gap', the retrieval performance of the existing cross-modal hash method can not meet the requirements of people. Moreover, for the existing cross-modal hash retrieval methods based on the shallow structure, most of the existing cross-modal hash retrieval methods use manual features, and the features have no universality for different cross-modal retrieval tasks, so that the identification capability of hash codes learned by the existing cross-modal hash retrieval methods is limited, and further, the retrieval performance of the shallow cross-modal hash retrieval methods cannot reach the optimal performance.

Therefore, in the related art at present, in the process of performing cross-modal data processing, the efficiency of data processing is low, and the performance is far from meeting the user requirements.

Aiming at the technical problems that cross-modal data processing is difficult to realize efficiently and the performance of a method for performing the cross-modal data processing is poor in the related technology, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present invention provide a cross-modal data processing method, device, storage medium, and electronic device, so as to at least solve the technical problem that it is difficult to efficiently implement cross-modal data processing in the related art, and the performance of a method for performing cross-modal data processing is poor.

According to an embodiment of the present invention, there is provided a cross-modality data processing method, including:

acquiring query data of a first mode; respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality contains a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used for indicating similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model by using a group of samples, and the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode; and determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters.

Optionally, before acquiring query data of the first modality, the method further comprises: acquiring a cross-modal dataset, wherein the cross-modal dataset comprises a training dataset and a testing dataset; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.

Optionally, acquiring a cross-modality data set comprises: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.

Optionally, training an initial neural network model using the training data set to obtain a target neural network model, including: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.

Optionally, inputting feature data of a first modality in the training data set and the object feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the object feature data into the second modality attention network model for training to obtain a trained second modality attention network model, including: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.

Optionally, constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model includes: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.

Optionally, determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters includes: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.

According to another embodiment of the present invention, there is provided a cross-modal data processing apparatus including: the acquisition module is used for acquiring query data of a first modality; a processing module, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities respectively to obtain a plurality of target parameters, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;

and the determining module is used for determining one or more retrieval data of the second modality into target data corresponding to the query data of the first modality according to the target parameters.

According to yet another embodiment of the invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program, when executed by a processor, performs the steps in any of the above method embodiments.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in any of the above method embodiments when executing the computer program.

According to the invention, the query data of the first mode is acquired, the target parameter between the query data of the first mode and each retrieval data of the second mode in the retrieval data set of the second mode is respectively determined to obtain a plurality of target parameters, one or more retrieval data of the second mode are determined as the target data corresponding to the query data of the first mode according to the target parameters, and the first mode and the second mode are effectively associated by using the object characteristic data as a bridge, so that semantic gap between different modes can be relieved, the technical problems that cross-mode data processing is difficult to effectively realize in the related technology and the performance of a cross-mode data processing method is poor can be solved, the efficiency of cross-mode data processing is improved, and the technical effect of the cross-mode data processing performance is optimized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative cross-modality data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic flow diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;

FIG. 3 is a schematic flow diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative cross-modality data processing method according to an embodiment of the present invention;

FIG. 5 is a block diagram of an alternative cross-modality data processing apparatus, according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the present invention running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a cross-mode data processing method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to the cross-modal data processing method in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a cross-modal data processing method operating on a mobile terminal, a computer terminal, or a similar computing device is provided, fig. 2 is a schematic flowchart of an alternative cross-modal data processing method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

s202, acquiring query data of a first mode;

s204, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the original data of the second modality into a target neural network model, the target parameter is used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model which are obtained by training based on the initial attention model, and a modality consistency model used for maintaining the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;

and S206, determining one or more second-mode retrieval data as target data corresponding to the query data of the first mode according to the target parameters.

Optionally, in the present embodiment, the first modality may include, but is not limited to, image, text, voice, video, motion capture, and the like. The second modality may include, but is not limited to, images, text, voice, video, motion capture, etc., and the first modality and the second modality are different modalities, for example, the first modality is images and the second modality is text, or the first modality is captured images and the second modality is images generated by simulation after motion capture.

Optionally, in this embodiment, the query data in the first modality may include, but is not limited to, a vector obtained by performing feature extraction on the data acquired in the first modality, and may also include, but is not limited to, a hash code generated by the vector obtained by performing feature extraction on the data acquired in the first modality.

Optionally, in this embodiment, the search data in the second modality may include, but is not limited to, a vector obtained by performing feature extraction on the data acquired in the second modality, and may further include, but is not limited to, a hash code generated by the vector obtained by performing feature extraction on the data acquired in the second modality, where the search data set in the second modality is a set composed of a plurality of predetermined search data in the second modality.

Optionally, in this embodiment, the target parameter may include, but is not limited to, a hamming distance between the hash code corresponding to the query data in the first modality and the hash code corresponding to the search data in the second modality, and the similarity may include, but is not limited to, representing by comparing magnitudes of the hamming distances, where the hamming distance is negatively related to the similarity, that is, the smaller the hamming distance is, the more similar the query data in the first modality and the search data in the second modality are.

Optionally, in this embodiment, the target neural network model may include, but is not limited to, one or more attention mechanism configuration-based neural network models, one or more convolutional neural network models, one or more modal consistency models, and may include, but is not limited to, a combination of one or more of the above.

Optionally, in this embodiment, the object feature data may include, but is not limited to, object feature data extracted from an image acquired by an image acquisition device through an image detection algorithm.

According to the embodiment, query data of a first modality are obtained, a target parameter between the query data of the first modality and each retrieval data of a second modality in a retrieval data set of the second modality is respectively determined to obtain a plurality of target parameters, one or more retrieval data of the second modality are determined to be target data corresponding to the query data of the first modality according to the target parameters, object feature data are used as a bridge to effectively associate the first modality with the second modality, semantic gaps among different modalities can be further relieved, the technical problem that cross-modality data processing is difficult to effectively achieve in the related technology and the performance of a cross-modality data processing method is poor can be solved, the efficiency of cross-modality data processing is improved, and the technical effect of the cross-modality data processing performance is optimized.

In an optional embodiment, the method further comprises: before query data of a first modality is acquired, acquiring a cross-modality data set, wherein the cross-modality data set comprises a training data set and a testing data set; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.

Optionally, in this embodiment, the hash representation may include, but is not limited to, establishing a valid semantic association between the image modality data and the text modality data, and is

And

learning unified hash representations

And

and further, cross-modal data processing is performed by using uniform hash representation.

Optionally, in this embodiment, for example, the first modality is an image modality, and the second modality is a text modality, and the media types included in the cross-modality data set are images and texts. For the image, pixel features are used as original input features in the network, and a convolutional neural network structure based on VGGNet-19 is used as a feature extractor, so that the method also supports other convolutional neural network structures for image feature extraction; for text, Word Embedding (Word Embedding) vectors are used as original input features, and Long Short-Term Memory (LSTM) neural networks are used as feature extractors.

In an alternative embodiment, acquiring a cross-modality data set includes: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.

Optionally, in this embodiment, taking the first modality as an image modality and the second modality as a text modality as an example, the use of the first modality may include, but is not limited to

A set of pixel feature vectors representing n objects in an image modality, wherein v_iAnd representing the pixel characteristic vector of the ith object in the image modality. Order to

A feature vector representing the n objects in a text modality, wherein t_iAnd representing the feature vector of the ith object in the text mode. A class label vector of n objects (corresponding to the aforementioned object feature data) is represented as

Where c represents the number of object classes. For vector y_iIn other words, if the ith object belongs to the kth class, let vector y_iIs 1, otherwise let vector y_iIs 0. After using the object attention model, for the ith object, let

Represents its output characteristics in an image modality, wherein_vUnknown parameters of the image modality; order to

Representing its output characteristics in a text modality, wherein_tThe feature vector sets V and T are training data sets for unknown parameters of the text mode.

Alternatively, in the present embodiment, use is made of

Feature vectors representing a query sample of an image modality, feature vectors using a query sample of a text modality

Representing the feature vector set of the image mode sample in the test data set as

The feature vector set of the text mode sample in the test data set is

Wherein the content of the first and second substances,

representing the number of samples in the test data set.

In an alternative embodiment, training an initial neural network model using the training data set to obtain a target neural network model comprises: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.

Optionally, in this embodiment, for example, the first modality is an image modality, and the second modality is a text modality, the initial neural network model includes, but is not limited to, a subject attention network model and a modality consistency model. The object attention network model comprises: image attention networks and text attention networks. The object attention network model takes the characteristics of the image object obtained by target detection as a link, and uses the image attention network and the text attention network to connect the image mode and the text mode from high-level semantics, so that the retrieval accuracy of Hash expression is improved. In the process of learning the Hash expression of the cross-modal data, the modal consistency model enables the learned Hash expression to keep the inter-modal and intra-modal consistency of the original cross-modal data, so that the neighbor relation of the Hash expression is constrained in the original neighbor topological framework, the original semantic relation of different modal data is maintained, and the retrieval precision is improved.

It should be noted that, for example, the first modality is an image modality, and the second modality is a text modality, where the object attention network model includes: the image attention network and the text attention network, and the object attention network model has three input data: image modal data, text modal data, image object data obtained by target detection. The image modal data is trained on the basis of receiving the migration knowledge by using an image attention network; the text modality data is trained using a text attention network. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link.

In an optional embodiment, inputting feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, including: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.

Optionally, in this embodiment, the image attention network migrates knowledge learned from the ImageNet classification task to VGGNet-19 and migrates the output of the last pooling layer of VGGNet-19

As an input feature of the image attention model, R is the number of image partitions. The image attention model (corresponding to the aforementioned first modality attention network model) maps the features O of each object_mAnd image partition characteristics

Inputting to a single-layer neural network and generating an attention distribution of an image on different image partitions by using a softmax function (corresponding to the aforementioned first preset function), namely:

wherein, "; "denotes the concatenation of vectors.

Is represented in a given object feature O_mThe attention probability of each image partition. Based on attention distribution

And I_jThe corresponding new feature vector is

(corresponding to the first target feature vector described earlier).

Optionally, in this embodiment, the text attention network (corresponding to the aforementioned second modality attention network model) extracts text features using an LSTM (long short term memory) network

And by using formulae

(corresponding to the second predetermined function described above) may be obtained as_iCorresponding new feature vector

(corresponding to the second target feature vector described previously).

In addition, let

Representing the output characteristics of n objects after the image modality is processed by two fully-connected layers, wherein theta_vUnknown parameters (corresponding to the aforementioned first preset parameters) of the image modality; order to

Representing the output characteristics of n objects after the text mode is processed by two fully-connected layers, wherein theta_tIs an unknown parameter of the text modality (corresponding to the aforementioned second preset parameter).

Suppose the characteristic f (v) of the ith object in the image modality and the text modality_i；θ_v) And g (t)_i；θ_t) Separately generating Hash codes in Hamming space

And

then, cross-modal hash learning can be performed by optimizing the following loss function:

wherein

And

for a k-bit hash code consisting of-1 and +1, the vectors of the ith columns of matrices F and G are respectively F (v)_i；θ_v) And g (t)_i；θ_t)。

In an optional embodiment, constraining the feature data of the first modality and the feature data of the second modality using the initial modality consistency model based on the semantic information of the feature data of the first modality and the semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model includes: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.

Optionally, in this embodiment, the modal consistency model is used to make the learned hash representation maintain inter-modal and intra-modal consistency of the original cross-modal data, so as to constrain the neighbor relation of the hash representation in the original neighbor topology framework, so as to maintain the original semantic relation of the different modal data. Modal coherence maintenance can be achieved by optimizing the following penalty function (corresponding to the target penalty function described previously):

wherein L is D-W,

denotes the ith diagonal element, w, of the diagonal matrix D_ijFor the element in the ith row and the jth column of the matrix W, B ═ B₁，b₂，...，b_n]^T∈{-1，+1}^n×k，

trace (·) represents the trace of the matrix,

namely the aboveThe first hash-code is encoded by the first hash-code,

namely the second hash code.

The mahalanobis distance between two data points of the image mode and the text mode respectively,

the Euclidean distance between two data points of an image mode and a text mode respectively, and lambda and beta are distance measurement balance factors. When the ith data point of the image modality and the jth data point of the text modality have the same semantic label, enabling C_ijOtherwise, let C be 10_ij0. Because the network architecture of the method is an end-to-end structure, the image attention network and the text attention network can be jointly trained through a back propagation algorithm.

In an optional embodiment, determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters comprises: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.

Optionally, in this embodiment, since the object attention network model is trained in (2), based on the trained model parameters, query data of one modality in a given test data set is subjected to one-time forward propagation calculation in the network, so that a uniform hash representation of the query data can be obtained. The similarity across modal data depends on the hamming distance between the uniform hash representations in hamming space. The smaller the hamming distance between the unified hash representations, the greater the similarity. In the present embodiment, the similarity between the image modality data and the text modality data is reflected by calculating a hamming distance between the unified hamming representations of the two.

Optionally, in this embodiment, one mode in the cross-mode test data set is used as a query data set, and the other mode is used as a retrieval data set, and cross-mode hash retrieval is performed to obtain a final retrieval result according to the similarity between the query data and the retrieval data.

The invention is further explained below with reference to specific examples:

aiming at the defects of the prior art, the invention provides a cross-modal Hash retrieval method based on an attention model, which can unify an object attention model and a modal consistency keeping model in a network architecture and realize the effective association of cross-modal data at a high-level semantic level. The method takes object features obtained by target detection as links, uses an image attention network and a text attention network to connect image modalities and text modalities from high-level semantics, restrains the neighbor relation of Hash representation in an original neighbor topological framework through a modality consistency model, maintains the original semantic relation of different modality data, enables the obtained Hash representation to be more suitable for cross-modality retrieval tasks, and improves the accuracy of cross-modality Hash retrieval.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a cross-modal hash retrieval method based on an attention model, which is used for learning a unified hash representation of data in different modalities in a unified network architecture, so as to implement cross-modal retrieval, fig. 3 is a flowchart of an optional cross-modal data processing method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

s302, constructing a cross-modal data set, and dividing data in the cross-modal data set into a training data set and a testing data set;

s304, training a cross-modal Hash learning model based on an attention model by using training data in a cross-modal data set, and learning uniform Hash expression for data in different modalities;

s306, obtaining uniform Hash expression of test data in the cross-modal data set by using the trained cross-modal Hash learning model parameters based on the attention model, and further calculating the similarity of the cross-modal data;

and S308, performing cross-modal Hash retrieval by taking one mode in the cross-modal test data set as a query data set and the other mode as a retrieval data set, and obtaining a final retrieval result according to the similarity between the query data and the retrieval data.

Further, a cross-modal hash retrieval method based on an attention model, where the cross-modal data set in step (1) includes two modality types, specifically, an image modality type and a text modality type.

Further, a cross-modal hash retrieval method based on an attention model, wherein the cross-modal hash learning model based on the attention model in the step (2) includes an object attention network model and a modal consistency model fused in a unified network architecture. The object attention network model comprises: image attention networks and text attention networks. The object attention network model takes the characteristics of the image object obtained by target detection as a link, and uses the image attention network and the text attention network to connect the image mode and the text mode from high-level semantics, so that the retrieval accuracy of Hash expression is improved. In the process of learning the Hash expression of the cross-modal data, the modal consistency model enables the learned Hash expression to keep the inter-modal and intra-modal consistency of the original cross-modal data, so that the neighbor relation of the Hash expression is constrained in the original neighbor topological framework, the original semantic relation of different modal data is maintained, and the retrieval precision is improved.

Further, in the attention model-based cross-modal hash retrieval method, the similarity of the cross-modal data in the step (3) depends on the hamming distance between the hash representations, and the smaller the hamming distance is, the greater the similarity is.

Further, a cross-modal hash retrieval method based on an attention model is characterized in that the cross-modal hash retrieval mode in the step (4) is that data of one type of modality is arbitrarily selected from the test data set in the step (1) to serve as a query sample, similarity calculation is carried out on all data of the other type of modality in the test set according to the cross-modal similarity calculation method in the step (3), then the data are sorted from large to small according to the similarity, and a retrieval result list is returned.

The invention has the following effects: compared with the existing method, the method can unify the object attention model and the modal consistency keeping model in one network architecture, and realize the effective association of cross-modal data in a high-level semantic level. In the process of uniform Hash expression learning of different modal data, the method utilizes an attention model to mine the semantic association of the multi-modal data, and utilizes a modal consistency model to keep the semantic association of the multi-modal data, thereby improving the accuracy of cross-modal retrieval.

The reason why the method has the above-mentioned inventive effect is that: the method unifies an object attention model and a modal consistency maintenance model into a network architecture. The attention model uses the object characteristics obtained by target detection as links, and uses an image attention network and a text attention network to semantically link an image modality and a text modality from a high level. The modal consistency model restrains the neighbor relation expressed by the Hash in the original neighbor topological framework, so that the original semantic relation of different modal data is maintained. The end-to-end network architecture formed by the two submodels fully excavates the semantic association of data in different modes, fully maintains the original semantic association of the data in different modes, promotes the learning of multi-mode data uniform hash representation, and improves the accuracy of cross-mode retrieval.

The attention model-based cross-modal hash retrieval method specifically includes, but is not limited to, the following:

(1) and constructing a cross-modal data set, and simultaneously dividing data in the cross-modal data set into a training data set and a testing data set.

In this embodiment, the media types included in the cross-modal dataset are images and texts. For the image, pixel features are used as original input features in the network, and a convolutional neural network structure based on VGGNet-19 is used as a feature extractor, so that the method also supports other convolutional neural network structures for image feature extraction; for text, Word Embedding (Word Embedding) vectors are used as original input features, and Long Short-Term Memory (LSTM) neural networks are used as feature extractors.

Use of

A feature vector representing the n objects in a text modality, wherein t_iAnd representing the feature vector of the ith object in the text mode. Representing a class label vector of n objects as

Representing its output characteristics in a text modality, wherein_tIs an unknown parameter of the text modality. The feature vector sets V and T are training data sets. Use of

The feature vector set of the text mode sample in the test data set is

Wherein the content of the first and second substances,

representing the number of samples in the test data set.

The goal of learning is to establish an effective semantic association between image modality data and text modality data by fusing an object attention model and a modality consistency model, and to do so for

And

learning unified hash representations

And

and further, a unified Hash representation is utilized to perform a cross-modal retrieval task.

(2) Training an attention model-based cross-modal hash learning model using training data in a cross-modal dataset for learning a uniform hash representation for data of different modalities.

The network structure constructed in this step is shown in fig. 4, and the attention model-based cross-modal hash learning model of the present invention includes an object attention network model and a modal consistency model fused in a unified network architecture. Wherein the object attention network model comprises: the image attention network 402 and the text attention network 404, and the subject attention network model shares three inputs: image modality data 406, text modality data 408, and target detected image object data 410. The image modal data is trained on the basis of receiving the migration knowledge by using an image attention network; the text modality data is trained using a text attention network. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link.

In this embodiment, the image attention network migrates knowledge learned from the ImageNet classification task to VGGNet-19 and migrates the output of the last pooling layer of VGGNet-19

As an input feature of the image attention model, R is the number of image partitions. The image attention model maps the features O of each object_mAnd image partition characteristics

Inputting the attention distribution of an image on different image partitions by using a softmax function, namely:

wherein, "; "denotes the concatenation of vectors.

And I_jThe corresponding new feature vector is

Text attention network extraction of text features using the LSTM (long short term memory) network

And by using formulae

Can be obtained with H_iCorresponding new feature vector

Order to

Representing the output characteristics of n objects after the image modality is processed by two fully-connected layers, wherein theta_vUnknown parameters of the image modality; order to

Representing the output characteristics of n objects after the text mode is processed by two fully-connected layers, wherein theta_tIs an unknown parameter of the text modality.

And

then, the following loss function can be optimized

Wherein

And

In this embodiment, the modal consistency model 412 is used to make the learned hash representation maintain inter-modal and intra-modal consistency of the original cross-modal data, so as to constrain the neighbor relation of the hash representation in the original neighbor topology framework, so as to maintain the original semantic relation of the data in different modalities. Modal consistency preservation can be achieved by optimizing the loss function as follows:

wherein L is D-W,

trace (·) represents the trace of the matrix,

ma's between two data points of image mode and text modeThe distance between the first and second electrodes,

(3) And obtaining uniform Hash expression of the test data in the cross-modal data set by using the trained cross-modal Hash learning model parameters based on the attention model, and further calculating the similarity of the cross-modal data.

After the object attention network model is trained in the step (2), based on trained model parameters, data of one mode in a test data set is given, and forward propagation calculation is performed once in the network, so that uniform hash representation of the data can be obtained. The similarity across modal data depends on the hamming distance between the uniform hash representations in hamming space. The smaller the hamming distance between the unified hash representations, the greater the similarity. In the present embodiment, the similarity between the image modality data and the text modality data is reflected by calculating a hamming distance between the unified hamming representations of the two.

(4) And taking one mode in the cross-mode test data set as a query data set and the other mode as a retrieval data set, performing cross-mode Hash retrieval, and obtaining a final retrieval result according to the similarity of the query data and the retrieval data.

The following experimental results show that compared with the existing method, the cross-modal Hash retrieval method based on the attention model can achieve higher retrieval accuracy.

The following describes the advantageous effects of the present invention with reference to specific experiments.

This example performed experiments on the Pascal VOC 2007 dataset. The Pascal VOC 2007 dataset contained 9963 images from 20 categories, each labeled. The data set was divided into a training set containing 5011 image-label pairs and a test set containing 4952 image-label pairs. The image modality uses the original pixel characteristics as input features. The text modality uses word-embedded vectors as input features. The experiment comprises two cross-modal retrieval tasks of image retrieval text and image retrieval by using the text. The results reported are the average of the results obtained from 10 random experiments. The following 3 methods were tested as experimental comparisons:

the prior method comprises the following steps: the Semantic depth Cross-modal Hashing method in the document "Semantic Deep Cross-modular Hashing" (author q.lin, w.cao, z.he, and z.he) improves a feature learning part by constructing Semantic label branches, so that the learned features can maintain Semantic information.

The prior method II comprises the following steps: the method is a depth Joint semantic reconstruction Hashing method in the document 'Deep Joint-correlation Hashing for Large-Scale Unverended Cross-Modal Retrieval' (authors S.Su, Z.ZHong, and C.ZHang), and a Joint semantic affine matrix constructed by the method can skillfully fuse original neighbor information from different modes.

The existing method is three: the Deep multi-scale Fusion Hashing method in the document 'Deep Multiscale Fusion Hashing for Cross-modular Retrieval' (author x.nie, b.wang, j.li, f.hao, m.jian, and y.yin) first designs different network branches for two modalities, and then fuses semantics of multiple scales by adopting a multi-scale Fusion model on each branch network to better mine semantic relevance.

In the experiment, the accuracy of the cross-modal retrieval is evaluated by using a MAP (mean Average precision) value which is commonly used in the field of information retrieval as an index, and the larger the MAP value is, the better the result of the cross-modal retrieval is.

Table 1 experimental results of the present invention and prior methods show

As can be seen from Table 1, the invention achieves an improvement in the retrieval accuracy in both the tasks of image retrieval of text and image retrieval of text compared with the existing method. In the first comparison method, the semantic label is used to maintain the semantic association among the original multi-modal data in the learned features, and the high-level semantic association mining among the multi-modal data is not sufficient. The second contrast method and the third contrast method focus on fusing semantic information of different modes from different layers, but the original semantic association of multi-mode data is not sufficiently considered. The invention unifies the object attention model and the modality consistency maintenance model into an end-to-end network architecture. The object attention network model semantically associates the image modality and the text modality from a high level using the image attention network and the text attention network with the characteristics of the image object obtained by the target detection as a link. The modal consistency model restrains the hash expressed neighbor relation in the original neighbor topology framework so as to maintain the original semantic relation of different modal data. The object attention model and the modal consistency keeping model fully mine and maintain high-level semantic information among multi-modal data, promote the learning of unified Hash expression of the multi-modal data, and improve the accuracy of cross-modal retrieval.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a cross-mode data processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram of an alternative cross-modality data processing apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:

an obtaining module 502, configured to obtain query data in a first modality;

a processing module 504, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities to obtain a plurality of target parameters, respectively, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;

a determining module 506, configured to determine one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters.

In an optional embodiment, the apparatus is further configured to: before query data of a first modality is acquired, acquiring a cross-modality data set, wherein the cross-modality data set comprises a training data set and a testing data set; training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities; inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data; determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.

In an alternative embodiment, the apparatus is configured to acquire the cross-modal dataset by: extracting a feature data set of a first modality using a convolutional neural network; extracting a feature data set of a second modality by using a long-short term memory neural network; determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set; determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.

In an alternative embodiment, the apparatus is configured to train an initial neural network model using the training data set to obtain a target neural network model by: inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model; determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model; constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model; determining the target object attention neural network model and the target modality consistency model as the target neural network model.

In an optional embodiment, the apparatus is configured to input feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and input feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, as follows: under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data; determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model; and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.

In an alternative embodiment, the apparatus is configured to use the initial modality consistency model to constrain the feature data of the first modality and the feature data of the second modality based on the semantic information of the feature data of the first modality and the semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model by: performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space; inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.

In an alternative embodiment, the apparatus is configured to determine one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters by: determining a third hash code corresponding to the query data of the first modality; querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality; calculating a hamming distance for each hash code of the third hash code and the set of hash codes; and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

In the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring query data of a first mode;

s2, respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality comprises a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting the raw data of the second modality into a target neural network model, the target parameters are used for indicating the similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training the initial neural network model by using a group of samples, the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, and a modality consistency model used for keeping the data consistency between the first modality and the second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, wherein the object characteristic data are obtained in an image object detection mode;

and S3, determining one or more second-mode retrieval data as target data corresponding to the first-mode query data according to the target parameters.

The computer readable storage medium is further arranged to store a computer program for performing the steps of:

s1, acquiring query data of a first mode;

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In an exemplary embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring query data of a first mode;

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cross-mode data processing method is characterized by comprising the following steps:

acquiring query data of a first mode;

respectively determining a target parameter between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality to obtain a plurality of target parameters, wherein the retrieval data set of the second modality contains a plurality of retrieval data of the second modality, the retrieval data of the second modality is data obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used for indicating similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model by using a group of samples, and the target neural network model comprises a first modality attention network model and a second modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;

and determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the target parameters.

2. The method of claim 1, wherein prior to acquiring query data of the first modality, the method further comprises:

acquiring a cross-modal dataset, wherein the cross-modal dataset comprises a training dataset and a testing dataset;

training an initial neural network model by using the training data set to obtain a target neural network model, wherein the initial neural network model comprises an initial object attention neural network model constructed based on an attention mechanism and an initial modal consistency model, and the target neural network model is used for learning a Hash expression mode for data of different modalities;

inputting the test data set into the target neural network model to obtain similarity of first modality data and second modality data, wherein the similarity is used for indicating the similarity between the first modality data and the second modality data;

determining predetermined parameters in the initial neural network model based on the similarity to update the target neural network model.

3. The method of claim 2, wherein acquiring a cross-modality data set comprises:

extracting a feature data set of a first modality using a convolutional neural network;

extracting a feature data set of a second modality by using a long-short term memory neural network;

determining a portion of the feature data in the feature data set of the first modality and a portion of the feature data in the feature data set of the second modality as the training data set;

determining the feature data set of the first modality and the feature data set of the second modality except the training data set as the test data set.

4. The method of claim 2, wherein training an initial neural network model using the training data set to obtain a target neural network model comprises:

inputting feature data of a first mode in the training data set and the object feature data into the first mode attention network model for training to obtain a trained first mode attention network model, and inputting feature data of a second mode in the training data set and the object feature data into the second mode attention network model for training to obtain a trained second mode attention network model;

determining the trained first modality attention network model and the trained second modality attention network model as a target object attention neural network model;

constraining the feature data of the first modality and the feature data of the second modality based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality using the initial modality consistency model to update the initial modality consistency model to a target modality consistency model;

determining the target object attention neural network model and the target modality consistency model as the target neural network model.

5. The method according to claim 4, wherein inputting feature data of a first modality in the training data set and the subject feature data into the first modality attention network model for training to obtain a trained first modality attention network model, and inputting feature data of a second modality in the training data set and the subject feature data into the second modality attention network model for training to obtain a trained second modality attention network model, comprises:

under the condition that the feature data of the first modality is image data and the feature data of the second modality is text data, generating a first target attention distribution by using a first preset function for the object feature data and the image data, and generating a second target attention distribution by using a second preset function for the object feature data and the text data;

determining a first target feature vector corresponding to the image data based on the first target attention distribution to update a first preset parameter in the first modal attention network model, so as to obtain a trained first modal attention network model;

and determining a second target feature vector corresponding to the text data based on the second target attention distribution to update a second preset parameter in the second modal attention network model, so as to obtain a trained second modal attention network model.

6. The method according to claim 5, wherein constraining the feature data of the first modality and the feature data of the second modality using the initial modality consistency model based on semantic information of the feature data of the first modality and semantic information of the feature data of the second modality to update the initial modality consistency model to a target modality consistency model comprises:

performing target processing on the first target characteristic vector to obtain a first hash code in a Hamming space, and performing the target processing on the second target characteristic vector to obtain a second hash code in the Hamming space;

inputting the first hash code and the second hash code into a target loss function to update the initial modal consistency model to a target modal consistency model.

7. The method of claim 1, wherein determining one or more of the retrieved data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters comprises:

determining a third hash code corresponding to the query data of the first modality;

querying a group of hash codes corresponding to a plurality of pieces of retrieval data in the second modality in the retrieval data set in the second modality;

calculating a hamming distance for each hash code of the third hash code and the set of hash codes;

and determining the retrieval data of the second modality corresponding to the hash codes with the Hamming distance smaller than or equal to the preset threshold value as one or more retrieval data of the second modality corresponding to the query data of the first modality.

8. A cross-modal data processing apparatus, comprising:

the acquisition module is used for acquiring query data of a first modality;

a processing module, configured to determine a target parameter between the query data of the first modality and each of the retrieval data of the second modalities respectively to obtain a plurality of target parameters, where the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, the retrieval data of the second modality is obtained by inputting raw data of the second modality into a target neural network model, the target parameter is used to indicate similarity between the query data of the first modality and the retrieval data of the second modality, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, and the target neural network model includes a first-modality attention network model and a second-modality attention network model obtained by training based on the initial attention model, the modal consistency model is used for maintaining data consistency between a first modality and a second modality, each sample pair in the group of sample pairs comprises sample data and object characteristic data, and the object characteristic data is object characteristic data obtained through an image object detection mode;

9. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method as claimed in any of claims 1 to 7 are implemented when the computer program is executed by the processor.