WO2022068195A1 - 跨模态的数据处理方法、装置、存储介质以及电子装置 - Google Patents

跨模态的数据处理方法、装置、存储介质以及电子装置 Download PDF

Info

Publication number
WO2022068195A1
WO2022068195A1 PCT/CN2021/091214 CN2021091214W WO2022068195A1 WO 2022068195 A1 WO2022068195 A1 WO 2022068195A1 CN 2021091214 W CN2021091214 W CN 2021091214W WO 2022068195 A1 WO2022068195 A1 WO 2022068195A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
training
modality
neural network
network model
Prior art date
Application number
PCT/CN2021/091214
Other languages
English (en)
French (fr)
Inventor
董西伟
严军荣
张小龙
Original Assignee
三维通信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三维通信股份有限公司 filed Critical 三维通信股份有限公司
Publication of WO2022068195A1 publication Critical patent/WO2022068195A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • Embodiments of the present invention relate to the field of communications, and in particular, to a cross-modal data processing method, device, storage medium, and electronic device.
  • Embodiments of the present invention provide a cross-modal data processing method, device, storage medium, and electronic device, so as to at least solve the problem in the related art that it is difficult to effectively implement cross-modal data processing, for cross-modal data processing
  • the technical problem of poor performance of the method is dealt with.
  • a cross-modal data processing method comprising: acquiring query data of a first modality; respectively determining the query data of the first modality and the retrieval data of the second modality target parameters between the retrieval data of each second modality in the set to obtain a plurality of target parameters, wherein the retrieval data set of the second modality includes a plurality of retrieval data of the second modality,
  • the retrieval data of the second modality is the data obtained by inputting the raw data of the second modality into the target neural network model, and the target parameter is used to indicate the query data of the first modality and the second modality.
  • the similarity of the retrieved data of the state, the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples, the target neural network model includes an encoder and a discriminator, the encoder comprising a sample encoder and a class label encoder, each of the sample pairs includes sample data and class label data, so that the sample data is input into the data obtained by the sample encoder and the class label data is input into the class label encoder
  • the obtained data cannot be distinguished by the discriminator; determining one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters .
  • the method before acquiring the query data of the first modality, the method further includes: repeating the following steps until the value of the objective function configured for the discriminator is the smallest: acquiring the first modality of the first modality The training data and the second training data of the second modality and the category labeling data; input the first training data and the category labeling data into the first initial neural network model to be trained, obtain the first training result, and use the The second training data and the category labeling data are input into the second initial neural network model to be trained, and a second training result is obtained; based on the first training result and the second training result, the target neural network model is adjusted The preset parameters to obtain the target neural network model.
  • the training of the second initial neural network model to obtain the second training result includes: inputting the first training data into a first encoder to obtain first target data, and inputting the second training data into the second encoder to obtain the second target data; input the category label data into the label encoder to obtain label data; input the first target data and the label data into the first discriminator to obtain the first discriminating result, and the second target data
  • the data and the label data are input into the second discriminator to obtain a second discrimination result; the first discrimination result is determined as the first training result, and the second discrimination result is determined as the second training result .
  • adjusting preset parameters of the target neural network model to obtain the target neural network model including at least one of the following: a training result and the second training result use a back propagation algorithm to determine the parameters of the target neural network model; use a stochastic gradient descent algorithm to determine the target neural network based on the first training result and the second training result parameters of the model.
  • the method further includes: generating a triplet set based on the first training data and the second training data, wherein each triplet in the triplet set includes the first training data, second training data with the same label as the first training data, and second training data with different labels from the first training data; minimizing the first selected anchor point through an objective function The Euclidean distance between a training data and a second training data with the same label as the first training data; maximize the first training data selected as an anchor point and the first training data with an objective function Euclidean distance between different labeled second training data; obtaining the constrained first training data and the constrained second training data.
  • the method further includes: processing the first training data with a symbolic function to obtain a first group of hash codes; Encoding and inputting a third discriminator to obtain a third discriminant result; determining the third discriminator result as a third training result; training the third discriminator and the first encoder based on the third training result, wherein the The first initial neural network model includes the first encoder; uses the sign function to process the second training data to obtain a second set of hash codes; input the second set of hash codes into a fourth discriminator , obtain a fourth identification result; determine the fourth identification result as the fourth training result; train the fourth discriminator and the second encoder based on the fourth training result, wherein the second initial The neural network model includes the second encoder.
  • the method before inputting the second training data and the category labeling data into a second initial neural network model to be trained, and obtaining a second training result, the method further includes: using a symbolic function to process the second initial neural network model. training data to obtain a second set of hash codes; input the second set of hash codes into a fourth discriminator to obtain a fourth discriminant result; determine the fourth discriminator result as a fourth training result; Four training results train the fourth discriminator and the second encoder, wherein the second initial neural network model includes the second encoder.
  • a cross-modal data processing device comprising: an acquisition module configured to acquire query data of a first modality; a processing module configured to determine the first modality respectively The target parameter between the query data of the second modal and the retrieval data of each second modal in the retrieval data set of the second modal, so as to obtain a plurality of target parameters, wherein the retrieval data set of the second modal contains multiple target parameters.
  • the retrieval data of the second modality is the data obtained by inputting the raw data of the second modality into the target neural network model
  • the target parameter is used to indicate the first similarity between the query data of the modality and the retrieval data of the second modality
  • the target neural network model is a neural network model obtained by training an initial neural network model using a set of samples
  • the target neural network model Including an encoder and a discriminator
  • the encoder includes a sample encoder and a class label encoder, each of the sample pairs includes sample data and class label data, so that the sample data is input into the data obtained by the sample encoder
  • the data obtained by inputting the class labeling data into the class labeling encoder cannot be distinguished by the discriminator
  • the determining module is configured to classify the retrieval data of one or more of the second modalities according to the plurality of target parameters It is determined as the target data corresponding to the query data of the first modality.
  • the device is further configured to: before acquiring the query data of the first modality, repeat the following steps until the value of the objective function configured for the discriminator is the smallest: acquiring the first modality of the first modality.
  • One training data and second training data of the second modality and category labeling data input the first training data and the category labeling data into the first initial neural network model to be trained, obtain the first training result, and use the
  • the second training data and the category labeling data are input into the second initial neural network model to be trained, and a second training result is obtained; based on the first training result and the second training result, the target neural network is adjusted The preset parameters of the model to obtain the target neural network model.
  • the device is further configured to input the first training data and the category labeling data into the first initial neural network model to be trained in the following manner, obtain a first training result, and input the second training
  • the data and the category label data are input into the second initial neural network model to be trained, and the second training result is obtained: input the first training data into the first encoder to obtain the first target data, and the second training data Input the second encoder to obtain the second target data; input the category label data into the label encoder to obtain label data; input the first target data and the label data into the first discriminator to obtain the first discrimination result , input the second target data and the label data into the second discriminator to obtain a second discrimination result; determine the first discrimination result as the first training result, and determine the second discrimination result is the second training result.
  • the device is further configured to adjust the preset parameters of the target neural network model based on the first training result and the second training result in at least one of the following ways to obtain the target neural network.
  • Network model determine the parameters of the target neural network model using a back-propagation algorithm based on the first training result and the second training result; use stochastic gradient descent based on the first training result and the second training result The algorithm determines the parameters of the target neural network model.
  • the apparatus is further configured to: generate a triplet set based on the first training data and the second training data, wherein each triplet in the triplet set includes a first training data, second training data with the same label as the first training data, and second training data with different labels from the first training data; minimizing the selected anchor point by an objective function The Euclidean distance between the first training data and the second training data with the same label as the first training data; maximize the first training data and the first training data selected as the anchor point through an objective function Euclidean distance between second training data with different labels; obtaining the constrained first training data and the constrained second training data.
  • the device is further configured to: input the first training data and the category label data into the first initial neural network model to be trained, obtain a first training result, and input the second training data into the first initial neural network model to be trained.
  • the category labeling data is input into the second initial neural network model to be trained, and before obtaining the second training result, the first training data is processed using a symbolic function to obtain a first group of hash codes; Encoding is input into the third discriminator to obtain the third discriminant result; the third discriminator result is determined as the third training result; the third discriminator and the first encoder are trained based on the third training result, wherein,
  • the first initial neural network model includes the first encoder; uses the sign function to process the second training data to obtain a second set of hash codes; input the second set of hash codes into a fourth identification to obtain a fourth discrimination result; determine the fourth discrimination result as a fourth training result; train the fourth discriminator and the second encoder based on the fourth training result, wherein the second The initial neural network model includes the
  • the device is further configured to: before inputting the second training data and the category label data into the second initial neural network model to be trained, and obtaining a second training result, use a symbolic function to process the first neural network model.
  • 2 training data to obtain a second set of hash codes; input the second set of hash codes into a fourth discriminator to obtain a fourth discriminator result; determine the fourth discriminator result as the fourth training result; based on the The fourth training result trains the fourth discriminator and the second encoder, wherein the second initial neural network model includes the second encoder.
  • a computer-readable storage medium is also provided, where a computer program is stored in the computer-readable storage medium, wherein, when the computer program is executed by a processor, any one of the above methods is implemented steps in the examples.
  • an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer
  • the query data of the first modality is obtained, and the target parameters between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality are respectively determined,
  • one or more retrieval data of the second modality is determined as target data corresponding to the query data of the first modality according to the plurality of target parameters, and the category label data is used as a bridge to convert the first modality.
  • the modal and the second modality are effectively associated, which can alleviate the semantic gap between different modalities, and can solve the difficulty in effectively implementing cross-modal data processing in related technologies.
  • the technical problem of poor performance of the method is to achieve the technical effect of improving the efficiency of cross-modal data processing and optimizing the performance of cross-modal data processing.
  • FIG. 1 is a hardware structural block diagram of a mobile terminal according to an optional cross-modal data processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of an optional cross-modal data processing method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an optional cross-modal data processing method according to an embodiment of the present invention.
  • FIG. 4 is a structural block diagram of an optional cross-modal data processing apparatus according to an embodiment of the present invention.
  • FIG. 1 is a hardware structural block diagram of a mobile terminal of a cross-modal data processing method according to an embodiment of the present invention.
  • the mobile terminal may include one or more (only one is shown in FIG.
  • processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 configured to store data, wherein the above-mentioned mobile terminal may further include a transmission device 106 and an input/output device 108 configured as a communication function.
  • a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • a memory 104 configured to store data
  • the above-mentioned mobile terminal may further include a transmission device 106 and an input/output device 108 configured as a communication function.
  • FIG. 1 is only a schematic diagram, which does not limit the structure of the above-mentioned mobile terminal.
  • the mobile terminal may also include more or fewer components than those shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
  • the memory 104 may be configured to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the cross-modal data processing method in the embodiment of the present invention. program, so as to execute various functional applications and data processing, that is, to realize the above-mentioned method.
  • Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include memory located remotely from the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • Transmission means 106 are arranged to receive or transmit data via a network.
  • the specific example of the above-mentioned network may include a wireless network provided by a communication provider of the mobile terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is configured to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • FIG. 2 is an optional cross-modal data processing method according to an embodiment of the present invention.
  • a schematic flowchart of the method, as shown in Figure 2 includes the following steps:
  • a cross-modal data processing method including:
  • the two-modal retrieval data set includes a plurality of retrieval data of the second modality.
  • the retrieval data of the second modality is the data obtained by inputting the original data of the second modality into the target neural network model.
  • the target parameter is used to indicate the similarity between the query data of the first modality and the retrieval data of the second modality, and the target neural network model is obtained by training the initial neural network model using a set of samples.
  • the target neural network model includes an encoder and a discriminator
  • the encoder includes a sample encoder and a class label encoder
  • each of the sample pairs includes sample data and class label data, such that the sample data
  • the data obtained by inputting the sample encoder and the data obtained by inputting the class labeling data into the class labeling encoder cannot be distinguished by the discriminator;
  • S206 Determine one or more retrieval data of the second modality as target data corresponding to the query data of the first modality according to the plurality of target parameters.
  • the above-mentioned first modality may include, but is not limited to, image, text, voice, video, motion capture, and the like.
  • the above-mentioned second mode may include, but is not limited to, images, text, voice, video, motion capture, etc.
  • the above-mentioned first and second modes are different modes.
  • the above-mentioned first mode is an image
  • the above-mentioned first mode is an image
  • the second mode is text, or the first mode is a captured image
  • the second mode is an image generated by simulation after motion capture.
  • the query data of the first modality may include, but is not limited to, a vector obtained after feature extraction is performed on the data obtained in the first modality, and may also include, but is not limited to, the first modality.
  • the hash code generated by the vector obtained after feature extraction is performed on the data obtained from the state.
  • the retrieval data of the second modality may include, but are not limited to, vectors obtained after feature extraction is performed on the data obtained by the second modality, and may also include, but are not limited to, the second modality.
  • the target parameter may include but is not limited to the difference between the hash code corresponding to the query data of the first modality and the hash code corresponding to the retrieval data of the second modality.
  • Hamming distance the above similarity can include but is not limited to be expressed by comparing the size of the Hamming distance, the above Hamming distance is negatively correlated with the above similarity, that is, in the case where the Hamming distance is smaller, the above No.
  • the query data of one modality is more similar to the retrieved data of the second modality.
  • the above-mentioned target neural network model may include, but is not limited to, one or more generative adversarial network models, one or more convolutional neural network models, and one or more multi-scale fusion models. Including but not limited to a combination of one or more of the above.
  • the above-mentioned category labeling encoder may include but is not limited to performing feature extraction on the labeled data, and encoding and decoding corresponding label information as a feature vector.
  • the above-mentioned category labeling may include but not limited to performing feature extraction. The corresponding category label during the classification process.
  • the foregoing set of sample pairs may include the following content:
  • (V, T) represents the image-text data pairs of n objects in image modality and text modality (corresponding to the aforementioned set of sample pairs), where, is the pixel feature vector set of n objects, v i represents the pixel feature vector of the i-th object in the image mode, is the bag-of-word vector set of these n objects, where t i represents the bag-of-word vector of the ith object.
  • the query data of the first modality is obtained, and the target parameters between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality are respectively determined.
  • the retrieval data of one or more second modalities are determined as the target data corresponding to the query data of the first modalities, and the category label data is used as a bridge, and the first
  • the modality and the second modality are effectively associated, which can alleviate the semantic gap between different modalities, and can solve the difficulty in effectively implementing cross-modal data processing in related technologies, which is used for cross-modal data processing.
  • the technical problem of poor performance of the method is to achieve the technical effect of improving the efficiency of cross-modal data processing and optimizing the performance of cross-modal data processing.
  • the method before acquiring the query data of the first modality, the method further includes: repeating the following steps until the value of the objective function configured for the discriminator is the smallest: acquiring the first The first training data of the modality, the second training data of the second modality, and the category labeling data; input the first training data and the category labeling data into the first initial neural network model to be trained to obtain the first training data result, and input the second training data and the category labeling data into the second initial neural network model to be trained to obtain a second training result; based on the first training result and the second training result, adjust the preset parameters of the target neural network model to obtain the target neural network model.
  • the above-mentioned objective function may include but is not limited to the first objective function of the first initial neural network model, the first objective function includes one or more first preset parameters, the second initial The second objective function of the neural network model, the second objective function contains one or more second preset parameters, in other words, for the training of the first neural network model, in the case of the first preset parameters, the first objective The minimum value of the function indicates that the training is completed. For the training of the second neural network model, in the case of the second preset parameter, the minimum value of the second objective function indicates that the training is completed.
  • the first training data of the first mode and the second training data of the second mode are used as an example.
  • the input of the data and the category labeling data into the first initial neural network model and the second initial neural network model may include the following:
  • the neural network LabNet which is a deep neural network whose input data is class labeling data.
  • LabNet consists of an auto-encoder.
  • the auto-encoder is denoted as LabNet Auto here
  • F (l) can be regarded as the semantic features learned by LabNet Auto .
  • LabNet Auto Using the output feature F (l) of the encoding layer of LabNet Auto as supervision information, guides ImgNet and TxtNet for better training, thereby reducing the semantic gap between image modalities and text modalities, and making image modalities and text modalities closer Modalities are better related semantically.
  • LabNet Auto needs to be well trained. For this purpose, it can include but not limited to training LabNet Auto with the objective function shown below:
  • Equation (3) and Equation (4) are hyperparameters
  • B (v) and B (t) are hash codes for image modality and text modality, respectively.
  • the first training data and the category labeling data are input into a first initial neural network model to be trained to obtain a first training result
  • the second training data and the The category labeling data is input into the second initial neural network model to be trained
  • obtaining the second training result includes: inputting the first training data into the first encoder to obtain the first target data, and inputting the second training data into the first encoder.
  • the second encoder obtains the second target data; the category label data is input into the label encoder to obtain label data; the first target data and the label data are input into the first discriminator to obtain the first discriminant result, and the The second target data and the label data are input into the second discriminator to obtain a second discrimination result; the first discrimination result is determined as the first training result, and the second discrimination result is determined as the Describe the second training result.
  • the above-mentioned first encoder may include, but is not limited to, first use a convolutional neural network to perform high-level semantic feature learning in the image modality, for convenience.
  • G (v) g (v) (V; ⁇ (v) ).
  • G (v) g (v) (V; ⁇ (v)
  • G (v) g (v) (V; ⁇ (v)
  • this vector corresponds to v i .
  • the deep neural network of the image modality in the present invention also includes an image autoencoder (Image Autoencoder) for further mining the high-level semantic information contained in the image modality data.
  • image Autoencoder image Autoencoder
  • this image autoencoder is denoted as ImgNet Auto here
  • ith vector in F (v) and Q (v) as and
  • the bag-of-words vectors are first processed using a multi-scale fusion model consisting of multiple mean pooling layers and 1 ⁇ 1 convolutional layers.
  • a multi-scale fusion model consisting of multiple mean pooling layers and 1 ⁇ 1 convolutional layers.
  • TxtNet MSF This multi-scale fusion model TxtNet MSF is helpful for discovering the relationship between different words, which in turn is helpful for mining the high-level semantic information contained in the text modal data.
  • the present invention uses An adversarial learning strategy is applied to the learning process of features F (l) , F (v) and F (t) .
  • the present invention designs two "inter-modal discriminators" to complete the discrimination task of the adversarial learning strategy between different modalities, these two discriminators are respectively: mark-image discriminator DLI (corresponding to the aforementioned first discriminator) and token-text discriminator DLT (corresponding to the aforementioned second discriminator).
  • the label-image discriminator DLI its input data are the output features F (l) of LabNet Auto and the output features F (v) of ImgNet Auto .
  • the discriminator D LI aims to combine "real data" as much as possible with “fake data” differentiate.
  • the two possible outputs of the discriminator D LI can be denoted by “0” and "1” respectively, specifically, “1” is used to indicate that the discriminator D LI has made a correct distinction, and "0” is used to indicate that the discriminator D LI has made a correct distinction.
  • D LI makes the wrong distinction.
  • D LI denotes the parameters of the discriminator D LI
  • D LI ( ) denotes the output of the discriminator D LI .
  • the and Minimization can effectively associate image modalities and text modalities with the semantic feature F (l) as a bridge, thereby easing the semantic gap between different modalities, and solving the problems existing in related technologies that are difficult to effectively Realize cross-modal data processing, the technical problem of poor performance of the method used for cross-modal data processing, and achieve the technical effect of improving the efficiency of cross-modal data processing and optimizing the performance of cross-modal data processing.
  • the method further includes: generating a triplet set based on the first training data and the second training data, wherein each triplet in the triplet set includes the selected is the first training data of the anchor point, the second training data with the same label as the first training data, and the second training data with different labels from the first training data; minimize the selected is the Euclidean distance between the first training data of the anchor point and the second training data with the same label as the first training data; maximize the first training data selected as the anchor point and the said first training data through the objective function
  • the first training data has Euclidean distances between the second training data with different labels; the constrained first training data and the constrained second training data are obtained.
  • the present invention applies triple constraints to the feature learning process of image modalities and text modalities.
  • the specific method is as follows: First, construct the form as The triplet set of , where v i is the image feature vector chosen as the anchor point, is the text vector from the text modality with the same token as vi, is the text vector from the text modality and has a different token from vi.
  • the triplet constraint aims to minimize the distance between the anchor and positive text samples and simultaneously maximize the anchor and negative text samples through a triplet loss function the distance between. That is, for triples
  • the triplet loss function is defined as: in, for and Euclidean distance between, for and Euclidean distance between . Therefore, the triplet loss function for all triples of the image modality is:
  • the semantic distribution of image modality data and text modality data can be adapted to each other by using triplet constraints, and then the semantic gap between different modalities can be reduced.
  • the image modality-specific information and the textual modality-specific information can also be preserved by using triple constraints.
  • the method further includes: processing the first training data with a symbolic function to obtain a first set of hash codes; The first set of hash codes is input into a third discriminator to obtain a third discriminant result; the third discriminator result is determined as a third training result; the third discriminator and the first code are trained based on the third training result wherein the first initial neural network model includes the first encoder; the second training data is processed using the sign function to obtain a second set of hash codes; the second set of hash codes is inputting a fourth discriminator to obtain a fourth discriminant result; determining the fourth discriminator result as a fourth training result; training the fourth discriminator and the second encoder based on the fourth training result, wherein, The second initial neural network model includes the second
  • the present invention designs two "intra-modal discriminators" to respectively complete the discrimination task of the confrontation learning strategy within each modal, and the two discriminators are respectively: the image modal discriminator D I (corresponding to the aforementioned third discrimination ) and a text modality discriminator DT (corresponding to the aforementioned fourth discriminator).
  • the discriminator D I its input data is the output feature G (v) of ImgNet CNN and the output feature Q (v) of ImgNet Auto .
  • the role of the discriminator D I is to convert the "real data" as much as possible Reconstructed data corresponding to it differentiate. Therefore, the two possible outputs of the discriminator D I can be denoted by "0" and "1” respectively, specifically, "1” is used to indicate that the discriminator D I has made a correct distinction, and "0” is used to indicate that the discriminator D I DI makes the wrong distinction.
  • the following objective function can be designed for the discriminator D I :
  • D I denotes the parameters of the discriminator D I
  • D I ( ⁇ ) denotes the output of the discriminator D I.
  • the method before inputting the second training data and the category labeling data into the second initial neural network model to be trained to obtain a second training result, the method further includes: using a symbolic function processing the second training data to obtain a second set of hash codes; inputting the second set of hash codes into a fourth discriminator to obtain a fourth discriminant result; determining the fourth discriminator result as the fourth training result ; Train the fourth discriminator and the second encoder based on the fourth training result, wherein the second initial neural network model includes the second encoder.
  • the feature vector of a query sample of the image modality is The feature vector of a query sample of the text modality is The feature vector set of the samples in the image modality retrieval sample set is: The feature vector set of the samples in the text modal retrieval sample set is in, Represents the number of samples in the retrieved sample set.
  • the hash codes of the image modality and text modality query samples and the samples in the retrieval sample set are: and Among them, ⁇ (v) and ⁇ (t) are the obtained deep neural network parameters of the image modality and text modality, respectively, sign( ⁇ ) is a sign function.
  • the preset parameters of the target neural network model are adjusted to obtain the target neural network model, including at least one of the following : determine the parameters of the target neural network model using a back propagation algorithm based on the first training result and the second training result; determine using a stochastic gradient descent algorithm based on the first training result and the second training result The parameters of the target neural network model.
  • These unknown variables can be solved by jointly optimizing the generative loss function and the adversarial loss function shown in Equation (12) and Equation (13).
  • the present invention adopts the "Minimax Game” scheme to optimize the formula (14) to solve the unknown variable.
  • the optimization problem of formula (14) is a very difficult optimization problem .
  • the present invention adopts an iterative optimization scheme to optimize formula (14). First by optimizing to solve for ⁇ (l) and B (l) , then fix ⁇ (l) and B (l) by optimizing to solve ⁇ (v) and B (v) , similarly, fixing ⁇ (l) and B (l) by optimizing to solve for ⁇ (t) and B (t) .
  • FIG. 3 is a schematic diagram of an optional cross-modal data processing method according to an embodiment of the present invention.
  • the specific implementation process mainly includes the following steps: Suppose (V, T) represents that n objects are in the image Image-text data pair for modal and text modal, where, is the pixel feature vector set of n objects, v i represents the pixel feature vector of the i-th object in the image mode, is the bag-of-word vector set of these n objects, where t i represents the bag-of-word vector of the ith object.
  • n objects denotes the label of the ith object, where c denotes the number of object categories, ( ⁇ ) T represents the transpose operation.
  • the convolutional neural network is first used to learn high-level semantic features in the image modality .
  • the deep neural network of the image modality in the present invention further includes an image autoencoder (Image Autoencoder) 304, which is used for further mining the high-level semantic information contained in the image modality data.
  • image autoencoder image Autoencoder
  • this image autoencoder is denoted as ImgNet Auto here
  • ith vector in F (v) and Q (v) as and
  • a multi-layer structure composed of multiple mean pooling layers and 1 ⁇ 1 convolutional layers is first used.
  • the scale fusion model 308 processes the bag of words vectors.
  • TxtNet MSF we denote this multi-scale fusion model as TxtNet MSF .
  • This multi-scale fusion model TxtNet MSF is helpful for discovering the relationship between different words, which in turn is helpful for mining the high-level semantic information contained in the text modal data.
  • a text autoencoder (Text Autoencoder) 312 is also included in the text modal deep neural network TxtNet 310, and this text autoencoder is referred to here as TxtNet Auto , and will
  • the method of the present invention also includes a neural network LabNet 314, which is a deep neural network whose input data is class labeling data.
  • LabNet is composed of an autoencoder.
  • F (l) can be regarded as the semantic features learned by LabNet Auto .
  • the present invention utilizes the output feature F (1) of the coding layer of LabNet Auto as supervision information to guide ImgNet and TxtNet to better train, thereby reducing the semantic gap between the image modality and the text modality, and making the image modality and text modalities are better associated semantically.
  • LabNet Auto needs to be well trained, for this reason, the present invention adopts the following objective function to train LabNet Auto :
  • the present invention designs the following goals:
  • ⁇ (v) and ⁇ (t) are hyperparameters
  • B (v) and B (t) are hash codes for image modality and text modality, respectively.
  • Minimize the two negative log-likelihood functions in Equation (3) and Equation (4) and is equivalent to maximizing their corresponding likelihood functions. When s ij 1, minimize can make and The similarity between the can make and The similarity between them becomes smaller. right A similar goal can be achieved by doing minimization optimization. Therefore, yes and Minimization can realize the effective association of image modalities and text modalities with the semantic feature F (l) as a bridge, which can alleviate the semantic gap between different modalities.
  • the present invention will measure the loss function of the relationship between pairs of data and are called pairwise losses, respectively.
  • the present invention applies an adversarial learning strategy to the learning process of features F (l) , F (v) and F (t) .
  • the present invention designs two "inter-modal discriminators" to complete the discrimination task of the adversarial learning strategy between different modalities, these two discriminators are: marker-image discriminator D LI 318 and marker-text Discriminator D LT 320.
  • the discriminator D LI aims to combine "real data" as much as possible with "fake data” differentiate. Therefore, the two possible outputs of the discriminator D LI can be denoted by "0” and "1” respectively, specifically, "1” is used to indicate that the discriminator D LI has made a correct distinction, and "0” is used to indicate that the discriminator D LI has made a correct distinction. D LI makes the wrong distinction. Based on the above analysis, the following objective function can be designed for the discriminator DLI :
  • D LI denotes the parameters of the discriminator D LI
  • D LI ( ) denotes the output of the discriminator D LI .
  • the present invention applies triple constraints to the feature learning process of image modalities and text modalities.
  • the specific method is as follows: First, construct the form as The triplet set of , where v i is the image feature vector chosen as the anchor point, is the text vector from the text modality with the same token as vi, is the text vector from the text modality and has a different token from vi. will be composed of vi and combined image-text pair is called a positive image-text pair, and similarly, will be defined by vi with combined image-text pair are called negative image-text pairs.
  • the form can be constructed as set of triples. Further, a positive text-image pair can be constructed and negative text-image pairs
  • the triplet constraint 322 aims to minimize the distance between the anchor point and the positive text sample and maximize the anchor point and the negative text through the triplet loss function distance between samples. That is, for triples
  • the triplet loss function is defined as: in, for and Euclidean distance between, for and Euclidean distance between . Therefore, the triplet loss function for all triples of the image modality is:
  • the semantic distribution of image modality data and text modality data can be adapted to each other by using triplet constraints, and then the semantic gap between different modalities can be reduced.
  • the image modality-specific information and the textual modality-specific information can also be preserved by using triple constraints.
  • the present invention introduces an adversarial learning strategy into the deep neural network training process of image modality and text modality.
  • the present invention designs two "intra-modal discriminators" to respectively complete the discrimination task of the confrontation learning strategy within each modal, and the two discriminators are respectively: the image modal discriminator D I 324 and the text modal discriminator D T 326.
  • the discriminator D I its input data is the output feature G (v) of ImgNet CNN and the output feature Q (v) of ImgNet Auto .
  • the role of the discriminator D I is to convert the "real data" as much as possible Reconstructed data corresponding to it differentiate. Therefore, the two possible outputs of the discriminator D I can be denoted by "0" and "1” respectively, specifically, "1” is used to indicate that the discriminator D I has made a correct distinction, and "0” is used to indicate that the discriminator D I DI makes the wrong distinction.
  • the following objective function can be designed for the discriminator D I :
  • D I denotes the parameters of the discriminator D I
  • D I ( ⁇ ) denotes the output of the discriminator D I.
  • These unknown variables can be solved by jointly optimizing the generative loss function and the adversarial loss function shown in Equation (12) and Equation (13).
  • the present invention adopts the "Minimax Game” scheme to optimize the formula (14) to solve the unknown variable.
  • the optimization problem of formula (14) is a very difficult optimization problem .
  • the present invention adopts an iterative optimization scheme to optimize formula (14). First by optimizing to solve for ⁇ (l) and B (l) , then fix ⁇ (l) and B (l) by optimizing to solve ⁇ (v) and B (v) , similarly, fixing ⁇ (l) and B (l) by optimizing to solve for ⁇ (t) and B (t) .
  • the feature vector of a query sample of the image modality is The feature vector of a query sample of the text modality is The feature vector set of the samples in the image modality retrieval sample set is: The feature vector set of the samples in the text modal retrieval sample set is in, Represents the number of samples in the retrieved sample set.
  • the hash codes of the image modality and text modality query samples and the samples in the retrieval sample set are: and Among them, ⁇ (v) and ⁇ (t) are the obtained deep neural network parameters of the image modality and text modality, respectively, sign( ⁇ ) is a sign function.
  • the distance calculation formula Query sample for computing image modality to the text modality retrieval sample set sample the Hamming distance.
  • Sample query for text modal Use the distance calculation formula Query sample for computing text modality To image modality retrieval sample set samples the Hamming distance.
  • first Hamming distance Sort from small to large, and then take the samples corresponding to the top K minimum distances in the text retrieval sample set as the retrieval results.
  • first Hamming distance Sort from small to large, and then take the samples corresponding to the top K minimum distances in the image retrieval sample set as the retrieval results.
  • the present invention performs experiments on the Pascal VOC 2007 data set to demonstrate its beneficial effects.
  • the Pascal VOC 2007 dataset contains 9963 images from 20 categories, each image is annotated with a label.
  • the dataset is divided into a training set containing 5011 image-label pairs and a test set containing 4952 image-label pairs.
  • the image modality uses raw pixel features as input features.
  • the text modality uses 399-dimensional word frequency features as input features.
  • the experiment mainly completes two cross-modal retrieval tasks: retrieving text with images and retrieving images with text. For convenience, these two cross-modal retrieval tasks are represented by Img2Txt and Txt2Img respectively.
  • the experiment uses the evaluation index MAP (Mean Average Precision) when evaluating the performance of the cross-modal hash retrieval method.
  • MAP Mel Average Precision
  • 5-fold cross-validation to determine the values of the hyperparameters in the method of the present invention.
  • the parameters in the comparison methods are set according to the principle of parameter setting recommended by each method.
  • the reported results are the average of the results obtained from 10 randomized experiments.
  • the methods for contrasting with the method of the present invention are respectively: (1) document “Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval” (authors E.Yang, C.Deng, W.Liu, X.Liu, D.Tao, and The PRDH method in X. Gao); (2) the MHTN method in the document “MHTN: Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval” (authors X. Huang, Y. Peng, and M. Yuan); (3) ) SSAH method in the document “Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval” (authors C. Li, C. Deng, N. Li, W.
  • Table 1 lists the MAPs when the method of the present invention and the comparative method perform cross-modal hash retrieval on the Pascal VOC 2007 dataset. It can be seen from Table 1 that for the two retrieval tasks Img2Txt and Txt2Img, the cross-modal retrieval performance of the method of the present invention is better than that of PRDH, MHTN and SSAH methods. This shows that the method of the present invention is an effective deep cross-modal hash retrieval method. This also shows that the scheme for improving feature discrimination based on adversarial learning, triple constraint and other technologies designed by the present invention is effective.
  • a cross-modal data processing apparatus is also provided, and the apparatus is used to implement the above-mentioned embodiments and preferred implementation manners, which have been described and will not be repeated.
  • the term "module” may be a combination of software and/or hardware that implements a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.
  • FIG. 4 is a structural block diagram of an optional cross-modal data processing apparatus according to an embodiment of the present invention. As shown in FIG. 4 , the apparatus includes:
  • Obtaining module 402 configured to obtain query data of the first modality
  • the processing module 404 is configured to respectively determine the preset parameters between the query data of the first modality and the retrieval data of each second modality in the retrieval data set of the second modality, so as to obtain a plurality of preset parameters , wherein the retrieval data set of the second modality includes a plurality of retrieval data of the second modality, and the retrieval data of the second modality is inputting the original data of the second modality into the target neural network model
  • the preset parameters are used to indicate the similarity between the query data of the first modality and the retrieval data of the second modality
  • the target neural network model uses a set of sample pairs to A neural network model obtained by training a neural network model, the target neural network model includes an encoder and a discriminator, the encoder includes a sample encoder and a class label encoder, and each pair of samples includes sample data and class labels data, so that the data obtained by inputting the sample data into the sample encoder and the data obtained by inputting the class label data into the class label encoder cannot
  • the determining module 406 is configured to determine, according to the plurality of preset parameters, one or more retrieval data of the second modality as target data corresponding to the query data of the first modality.
  • the apparatus is further configured to: before acquiring the query data of the first modality, repeat the following steps until the value of the objective function configured for the discriminator is the smallest: acquiring the first modality The first training data of a modality, the second training data of the second modality, and the category labeling data; input the first training data and the category labeling data into the first initial neural network model to be trained to obtain the first training results, and input the second training data and the category label data into the second initial neural network model to be trained to obtain a second training result; based on the first training results and the second training results, adjust The preset parameters of the target neural network model to obtain the target neural network model.
  • the apparatus is further configured to input the first training data and the category labeling data into the first initial neural network model to be trained in the following manner, to obtain a first training result, and to The second training data and the category labeling data are input into the second initial neural network model to be trained to obtain the second training result: input the first training data into the first encoder to obtain the first target data, The second training data is input into the second encoder to obtain second target data; the category label data is input into the label encoder to obtain label data; the first target data and the label data are input into the first discriminator, Obtain the first discrimination result, input the second target data and the label data into the second discriminator, and obtain the second discrimination result; determine the first discrimination result as the first training result, and use the The second identification result is determined as the second training result.
  • the apparatus is further configured to adjust preset parameters of the target neural network model based on the first training result and the second training result in at least one of the following manners,
  • the training results use a stochastic gradient descent algorithm to determine the parameters of the target neural network model.
  • the apparatus is further configured to: generate a triplet set based on the first training data and the second training data, wherein each triplet in the triplet set includes a The first training data selected as the anchor point, the second training data with the same label as the first training data, and the second training data with different labels from the first training data; The Euclidean distance between the first training data selected as the anchor point and the second training data with the same label as the first training data; maximize the first training data selected as the anchor point and the second training data through the objective function.
  • the first training data has Euclidean distances between the second training data with different labels; the constrained first training data and the constrained second training data are obtained.
  • the apparatus is further configured to: input the first training data and the category labeling data into a first initial neural network model to be trained, obtain a first training result, and use the The second training data and the category labeling data are input into the second initial neural network model to be trained, and before obtaining the second training result, the first training data is processed using a symbolic function to obtain a first set of hash codes; The first set of hash codes is input into a third discriminator to obtain a third discriminant result; the third discriminator result is determined as a third training result; the third discriminator and the first discriminator are trained based on the third training result an encoder, wherein the first initial neural network model includes the first encoder.
  • the apparatus is further configured to: before inputting the second training data and the category label data into the second initial neural network model to be trained, and obtaining the second training result, use the symbol
  • the function processes the second training data to obtain a second set of hash codes; inputs the second set of hash codes into a fourth discriminator to obtain a fourth discriminant result; determines the fourth discriminator result as the fourth training result; training the fourth discriminator and the second encoder based on the fourth training result, wherein the second initial neural network model includes the second encoder.
  • the above modules can be implemented by software or hardware, and the latter can be implemented in the following ways, but not limited to this: the above modules are all located in the same processor; or, the above modules can be combined in any combination The forms are located in different processors.
  • Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the steps in any of the above method embodiments when running.
  • the above-mentioned computer-readable storage medium may be configured to store a computer program for performing the following steps: S1, acquiring query data of the first modality; S2, respectively determining the query data of the first modality and the first modality Target parameters between the retrieval data of each second modality in the two-modality retrieval data set to obtain multiple target parameters, wherein the second-modality retrieval data set includes a plurality of second modality retrievals data, the retrieval data of the second modality is the data obtained by inputting the raw data of the second modality into the target neural network model, and the target parameter is used to indicate the similarity between the query data of the first modality and the retrieval data of the second modality
  • the target neural network model is a neural network model obtained by training the initial neural network model using a set of samples.
  • the target neural network model includes an encoder and a discriminator.
  • the encoder includes a sample encoder and a class label encoder.
  • Each The sample pair includes sample data and class labeling data, so that the data obtained by inputting the sample data into the sample encoder and the data obtained by entering the class labeling data into the class labeling encoder cannot be distinguished by the discriminator; S3, according to multiple target parameters, a The retrieval data of one or more second modalities are determined as target data corresponding to the query data of the first modality.
  • the computer-readable storage medium is further configured to store a computer program for performing the following steps: S1, acquiring query data of the first modality; S2, respectively determining the query data set of the first modality and the retrieval data set of the second modality target parameters between the retrieval data of each second modality in the
  • the retrieval data is the data obtained by inputting the original data of the second modality into the target neural network model, and the target parameter is used to indicate the similarity between the query data of the first modality and the retrieval data of the second modality
  • the target neural network model is The neural network model obtained by training the initial neural network model with a set of samples.
  • the target neural network model includes an encoder and a discriminator.
  • the encoder includes a sample encoder and a class label encoder.
  • Each sample pair includes sample data and a class. Labeling the data, so that the data obtained by inputting the sample data into the sample encoder and the data obtained by inputting the class labeling data into the class labeling encoder cannot be differentiated by the discriminator; S3, according to multiple target parameters, classify one or more second modalities
  • the retrieved data is determined as the target data corresponding to the query data of the first modality.
  • the above-mentioned computer-readable storage medium may include, but is not limited to, a USB flash drive, a read-only memory (Read-Only Memory, referred to as ROM for short), and a random access memory (Random Access Memory, referred to as RAM for short) , mobile hard disk, magnetic disk or CD-ROM and other media that can store computer programs.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • An embodiment of the present invention also provides an electronic device, comprising a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any of the above method embodiments.
  • the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
  • the above-mentioned processor may be configured to perform the following steps through a computer program: S1, acquiring query data of the first modality; S2, respectively determining the query data of the first modality and the query data of the second modality Retrieve the target parameters between the retrieved data of each second modality in the data set to obtain multiple target parameters, wherein the retrieved data set of the second modality includes a plurality of retrieved data of the second modality, and the second modality
  • the retrieval data of the modality is the data obtained by inputting the original data of the second modality into the target neural network model.
  • the target parameter is used to indicate the similarity between the query data of the first modality and the retrieval data of the second modality.
  • the network model is a neural network model obtained by training the initial neural network model using a set of samples.
  • the target neural network model includes an encoder and a discriminator.
  • the encoder includes a sample encoder and a class label encoder. Each sample pair includes a sample data and class labeling data, so that the data obtained by inputting the sample data into the sample encoder and the data obtained by inputting the class labeling data into the class labeling encoder cannot be distinguished by the discriminator; S3, according to multiple target parameters, one or more first The retrieval data of the second modality is determined as the target data corresponding to the query data of the first modality.
  • modules or steps of the present invention can be implemented by a general-purpose computing device, which can be centralized on a single computing device, or distributed in a network composed of multiple computing devices
  • they can be implemented in program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be performed in a different order than shown here.
  • the described steps, or they are respectively made into individual integrated circuit modules, or a plurality of modules or steps in them are made into a single integrated circuit module to realize.
  • the present invention is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

一种跨模态的数据处理方法、装置、存储介质以及电子装置,跨模态的数据处理方法包括:采用获取第一模态的查询数据,分别确定第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,根据多个目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据,利用类别标记数据作为桥梁,将第一模态和第二模态有效地关联起来,进而可以缓解不同模态之间的语义鸿沟,能够解决相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题,达到提高跨模态数据处理的效率,优化跨模态的数据处理性能的技术效果。

Description

跨模态的数据处理方法、装置、存储介质以及电子装置 技术领域
本发明实施例涉及通信领域,具体而言,涉及一种跨模态的数据处理方法、装置、存储介质以及电子装置。
背景技术
在实际应用中,对象可以用来自不同模态的特征进行描述,例如,在微信之类的社交平台,人们经常使用图片和相应的文字记录所发生的某个事件。跨模态检索旨在使用一个模态中的实例去检索另一个模态中与其语义相似的实例,例如,用图像检索与之相关的文档。随着多媒体技术的发展,多模态数据的数量也迅速增长。在大规模多模态数据集上,如何在不同模态之间完成信息检索是非常具有挑战性的问题。对于这个问题,哈希方法的低存储代价和高检索速度特点使其在跨模态检索领域受到广泛关注。
不同模态的数据分布和数据表示的不一致性,使得在不同模态之间直接进行相似性度量是非常困难的。这种困难亦可称为“模态鸿沟”,它是影响跨模态哈希检索性能的主要障碍。由于“模态鸿沟”的原因,现有跨模态哈希方法的检索性能还远不能满足人们的需求。并且,对于现有的基于浅层结构的跨模态哈希检索方法来说,因为它们大部分都使用手工特征,并且这些特征对不同的跨模态检索任务不具有通用性,因此,它们学习得到的哈希编码的鉴别能力是有限的,进而,这些浅层跨模态哈希检索方法的检索性能不能达到最优。
因此,目前的相关技术中,在进行跨模态的数据处理的过程中,数据处理的效率较低,性能远不能满足用户需求。
针对相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种跨模态的数据处理方法、装置、存储介质以及电子装置,以至少解决相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题。
根据本发明的一个实施例,提供了一种跨模态的数据处理方法,包括:获取第一模态的查询数据;分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述目标参数用于指示所述第一模态的查询数据与所述第二 模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;根据所述多个目标参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
可选地,在获取第一模态的查询数据之前,所述方法还包括:重复执行以下步骤,直到为所述鉴别器所配置的目标函数的取值最小:获取第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据;将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果;基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型。
可选地,将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果,包括:将所述第一训练数据输入第一编码器,得到第一目标数据,将所述第二训练数据输入第二编码器,得到第二目标数据;将所述类别标记数据输入标记编码器,得到标签数据;将所述第一目标数据和所述标签数据输入第一鉴别器,得到第一鉴别结果,将所述第二目标数据和所述标签数据输入第二鉴别器,得到第二鉴别结果;将所述第一鉴别结果确定为所述第一训练结果,并将所述第二鉴别结果确定为所述第二训练结果。
可选地,基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型,包括以下至少之一:基于所述第一训练结果和所述第二训练结果使用后向传播算法确定所述目标神经网络模型的参数;基于所述第一训练结果和所述第二训练结果使用随机梯度下降算法确定所述目标神经网络模型的参数。
可选地,所述方法还包括:基于所述第一训练数据以及第二训练数据生成三元组集,其中,所述三元组集中的每个三元组包括被选为锚点的第一训练数据、与所述第一训练数据具有相同标记的第二训练数据以及与所述第一训练数据具有不同标记的第二训练数据;通过目标函数最小化所述被选为锚点的第一训练数据与所述第一训练数据具有相同标记的第二训练数据之间的欧氏距离;通过目标函数最大化所述被选为锚点的第一训练数据与所述第一训练数据具有不同标记的第二训练数据之间的欧氏距离;得到约束后的所述第一训练数据和约束后的所述第二训练数据。
可选地,在将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二 初始神经网络模型,得到第二训练结果之前,所述方法还包括:使用符号函数处理所述第一训练数据,得到第一组哈希编码;将所述第一组哈希编码输入第三鉴别器,得到第三鉴别结果;将所述第三鉴别结果确定为第三训练结果;基于所述第三训练结果训练所述第三鉴别器和第一编码器,其中,所述第一初始神经网络模型包括所述第一编码器;使用所述符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和所述第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
可选地,在将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,所述方法还包括:使用符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
根据本发明的另一个实施例,提供了一种跨模态的数据处理装置,包括:获取模块,设置为获取第一模态的查询数据;处理模块,设置为分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述目标参数用于指示所述第一模态的查询数据与所述第二模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;确定模块,设置为根据所述多个目标参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
可选地,所述装置还设置为:在获取第一模态的查询数据之前,重复执行以下步骤,直到为所述鉴别器所配置的目标函数的取值最小:获取第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据;将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果;基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型。
可选地,所述装置还设置为通过如下方式将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果:将所述第一训练数据 输入第一编码器,得到第一目标数据,将所述第二训练数据输入第二编码器,得到第二目标数据;将所述类别标记数据输入标记编码器,得到标签数据;将所述第一目标数据和所述标签数据输入第一鉴别器,得到第一鉴别结果,将所述第二目标数据和所述标签数据输入第二鉴别器,得到第二鉴别结果;将所述第一鉴别结果确定为所述第一训练结果,并将所述第二鉴别结果确定为所述第二训练结果。
可选地,所述装置还设置为通过如下至少之一的方式基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型:基于所述第一训练结果和所述第二训练结果使用后向传播算法确定所述目标神经网络模型的参数;基于所述第一训练结果和所述第二训练结果使用随机梯度下降算法确定所述目标神经网络模型的参数。
可选地,所述装置还设置为:基于所述第一训练数据以及第二训练数据生成三元组集,其中,所述三元组集中的每个三元组包括被选为锚点的第一训练数据、与所述第一训练数据具有相同标记的第二训练数据以及与所述第一训练数据具有不同标记的第二训练数据;通过目标函数最小化所述被选为锚点的第一训练数据与所述第一训练数据具有相同标记的第二训练数据之间的欧氏距离;通过目标函数最大化所述被选为锚点的第一训练数据与所述第一训练数据具有不同标记的第二训练数据之间的欧氏距离;得到约束后的所述第一训练数据和约束后的所述第二训练数据。
可选地,所述装置还设置为:在将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,使用符号函数处理所述第一训练数据,得到第一组哈希编码;将所述第一组哈希编码输入第三鉴别器,得到第三鉴别结果;将所述第三鉴别结果确定为第三训练结果;基于所述第三训练结果训练所述第三鉴别器和第一编码器,其中,所述第一初始神经网络模型包括所述第一编码器;使用所述符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和所述第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
可选地,所述装置还设置为:在将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,使用符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
根据本发明的又一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被处理器执行时实现上述任一项方法实施 例中的步骤。
根据本发明的又一个实施例,还提供了一种电子装置,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一项方法实施例中的步骤。
通过本发明,采用获取第一模态的查询数据,分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,根据多个目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据,利用类别标记数据作为桥梁,将第一模态和第二模态有效地关联起来,进而可以缓解不同模态之间的语义鸿沟,能够解决相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题,达到提高跨模态数据处理的效率,优化跨模态的数据处理性能的技术效果。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的一种可选的跨模态的数据处理方法的移动终端的硬件结构框图;
图2是根据本发明实施例的一种可选的跨模态的数据处理方法的流程示意图;
图3是根据本发明实施例的一种可选的跨模态的数据处理方法的示意图;
图4是根据本发明实施例的一种可选的跨模态的数据处理装置的结构框图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明的实施例。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图1是本发明实施例的一种跨模态的数据处理方法的移动终端的硬件结构框图。如图1所示,移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和设置为存储数据的存储器104,其中,上述移动终端还可以包括设置为通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可设置为存储计算机程序,例如,应用软件的软件程序以及模块,如本发明实施例中的跨模态的数据处理方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106设置为经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其设置为通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于移动终端、计算机终端或者类似的运算装置的跨模态的数据处理方法,图2是根据本发明实施例的一种可选的跨模态的数据处理方法的流程示意图,如图2所示,该流程包括如下步骤:
根据本发明的一个实施例,提供了一种跨模态的数据处理方法,包括:
S202,获取第一模态的查询数据;
S204,分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述目标参数用于指示所述第一模态的查询数据与所述第二模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;
S206,根据所述多个目标参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
可选地,在本实施例中,上述第一模态可以包括但不限于图像、文字、语音、视频、动作捕捉等。上述第二模态可以包括但不限于图像、文字、语音、视频、动作捕捉等,上述第一模态和第二模态为不同的模态,例如,上述第一模态为图像,上述第二模态为文字,或者,上述第一模态为拍摄的图像,上述第二模态为动作捕捉后,模拟生成的图像等。
可选地,在本实施例中,上述第一模态的查询数据可以包括但不限于对第一模态获取到的数据进行特征提取后得到的向量,还可以包括但不限于对第一模态获取到的数据进行特征提取后得到的向量所生成的哈希编码。
可选地,在本实施例中,上述第二模态的检索数据可以包括但不限于对第二模态获取到的数据进行特征提取后得到的向量,还可以包括但不限于对第二模态获取到的数据进行特征提取后得到的向量所生成的哈希编码,上述第二模态的检索数据集合是由多个预先确定的第二模态的检索数据所组成的集合。
可选地,在本实施例中,上述目标参数可以包括但不限于上述第一模态的查询数据所对应的哈希编码与上述第二模态的检索数据所对应的哈希编码之间的汉明距离,上述相似性可以包括但不限于通过比较汉明距离的大小来进行表示,上述汉明距离与上述相似性呈负相关,也即,在汉明距离越小的情况下,上述第一模态的查询数据和第二模态的检索数据越相似。
可选地,在本实施例中,上述目标神经网络模型可以包括但不限于一个或多个生成式对抗网络模型、一个或多个卷积神经网络模型、一个或多个多尺度融合模型,可以包括但不限于上述的一种或者多种的组合。
可选地,在本实施例中,上述类别标记编码器可以包括但不限于对已标注的数据进行特征提取,将对应的标记信息作为特征向量进行编解码,上述类别标记可以包括但不限于进行分类过程中对应的类别标记。
可选地,在本实施例中,以第一模态为图像模态、第二模态为文本模态为例,上述一组样本对可以包括如下内容:
假设(V,T)表示n个对象在图像模态和文本模态的图像-文本数据对(对应于前述的一组样本对),其中,
Figure PCTCN2021091214-appb-000001
为n个对象的像素特征向量集,v i表示第i个对象在图像模态的像素特征向量,
Figure PCTCN2021091214-appb-000002
为这n个对象的词袋向量集,其中,t i表示第i个对象的词袋向量。假设n个对象的类别标记向量为
Figure PCTCN2021091214-appb-000003
l i=[l i1,l i2,...,l ic] T(i=1,2,...,n)表示第i个对象的标签,其中,c表示对象类别的数量,(·) T表示转置运算。对于向量l i来说,如果第i个对象属于第k类,则l ik=1,否则,l ik=0。使用语义相似矩阵
Figure PCTCN2021091214-appb-000004
来表示两个对象之间的相似程度,如果第i个对象与第j个对象在语义上相似,则s ij=1,否则,s ij=0,以实现训练得到目标神经网络模型,以及获得上述第二模态的检索数据集合。
通过本实施例,采用获取第一模态的查询数据,分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,根据多个目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据,利用类别标记数据作为桥梁,将第一模态和第二模态有效地关联起来,进而可以缓 解不同模态之间的语义鸿沟,能够解决相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题,达到提高跨模态数据处理的效率,优化跨模态的数据处理性能的技术效果。
在一个可选的实施例中,在获取第一模态的查询数据之前,所述方法还包括:重复执行以下步骤,直到为所述鉴别器所配置的目标函数的取值最小:获取第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据;将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果;基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型。
可选地,在本实施例中,上述目标函数可以包括但不限于第一初始神经网络模型的第一目标函数,第一目标函数中包含有一个或多个第一预设参数,第二初始神经网络模型的第二目标函数,第二目标函数中包含有一个或多个第二预设参数,换言之,对于第一神经网络模型的训练,在第一预设参数的情况下,第一目标函数取值最小时表示训练完成,对于第二神经网络模型的训练,在第二预设参数的情况下,第二目标函数取值最小时表示训练完成。
可选地,在本实施例中,以第一模态为图像模态、第二模态为文本模态为例,上述第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据输入第一初始神经网络模型以及第二初始神经网络模型可以包括如下内容:
例如,神经网络LabNet,它是输入数据为类别标记数据的深度神经网络。LabNet由一个自编码器构成,为了方便起见,这里将该自编码器记为LabNet Auto,并将LabNet Auto的编码层的输出特征记为F (l)=f (l)(L;θ (l)),其中,θ (l)为深度神经网络LabNet的参数。F (l)可以看作由LabNet Auto学习得到的语义特征。利用LabNet Auto的编码层的输出特征F (l)作为监督信息,引导ImgNet和TxtNet更好地进行训练,从而实现缩小图像模态和文本模态之间的语义鸿沟,并使图像模态和文本模态更好地从语义上关联起来。为了达到上述目的,LabNet Auto需要经过良好的训练,为此,可以包括但不限于采用如下所示的目标函数训练LabNet Auto
Figure PCTCN2021091214-appb-000005
其中,
Figure PCTCN2021091214-appb-000006
为与标记向量l i相对应的LabNet Auto的编码层的输出向量,α (l)为超参数,B (l)为哈希编码。公式(1)中的
Figure PCTCN2021091214-appb-000007
为负对数似然函数,且似然函数的定义如下:
Figure PCTCN2021091214-appb-000008
其中,
Figure PCTCN2021091214-appb-000009
Figure PCTCN2021091214-appb-000010
用于保持F (l)中不同特征向量间的相似性。
Figure PCTCN2021091214-appb-000011
为用于控制哈希编码B (l)的量化误差的目标函数项。
为了将LabNet Auto学习得到的语义特征F (l)用于监督图像模态和文本模态的特征学习过程,通过如下目标函数实现:
Figure PCTCN2021091214-appb-000012
Figure PCTCN2021091214-appb-000013
其中,
Figure PCTCN2021091214-appb-000014
α (v)和α (t)为超参数,B (v)和B (t)分别为图像模态和文本模态的哈希编码。最小化公式(3)和公式(4)中的两个负对数似然函数
Figure PCTCN2021091214-appb-000015
Figure PCTCN2021091214-appb-000016
等价于最大化它们相应的似然函数。当s ij=1时,最小化
Figure PCTCN2021091214-appb-000017
可以使得
Figure PCTCN2021091214-appb-000018
Figure PCTCN2021091214-appb-000019
之间的相似度变大,与此相反,当s ij=0时,最小化
Figure PCTCN2021091214-appb-000020
可以使得
Figure PCTCN2021091214-appb-000021
Figure PCTCN2021091214-appb-000022
之间的相似度变小。对
Figure PCTCN2021091214-appb-000023
进行最小化优化也可以实现类似的目标。
因此,对
Figure PCTCN2021091214-appb-000024
Figure PCTCN2021091214-appb-000025
进行最小化,可以实现以语义特征F (l)为桥梁将图像模态和文本模态有效地关联起来,进而可以缓解不同模态之间的语义鸿沟。本发明将衡量成对数据之间关系的损失函数
Figure PCTCN2021091214-appb-000026
Figure PCTCN2021091214-appb-000027
分别称为成对损失。
在一个可选的实施例中,将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果,包括:将所述第一训练数据输入第一编码器,得到第一目标数据,将所述第二训练数据输入第二编码器,得到第二目标数据;将所述类别标记数据输入标记编码器,得到标签数据;将所述第一目标数据和所述标签数据输入第一鉴别器,得到第一鉴别结果,将所述第二目标数据和所述标签数据输入第二鉴别器,得到第二鉴别结果;将所述第一鉴别结果确定为所述第一训练结果,并将所述第二鉴别结果确定为所述第二训练结果。
可选地,在本实施例中,以第一模态为图像模态为例,上述第一编码器可以包括但不限于在图像模态首先使用卷积神经网络进行高层语义特征学习,为了方便起见,这里将所使用的卷积神经网络记为ImgNet CNN并将ImgNet CNN的输出特征记为G (v)=g (v)(V;θ (v))。进一步,用
Figure PCTCN2021091214-appb-000028
表示G (v)中的第i个向量,且该向量对应于v i。本发明中的图像模态的深度神经网络还包含一个图像自编码器(Image Autoencoder),用于进一步挖掘图像模态数据中所蕴 含的高层语义信息。为了描述方便,这里将这个图像自编码器表示为ImgNet Auto,并将ImgNet Auto的编码层的输出特征和ImgNet Auto的输出特征分别记为F (v)=f (v)(V;θ (v))和Q (v)=q (v)(V;θ (v)),其中,θ (v)表示图像模态的深度神经网络ImgNet的参数。进一步,将F (v)和Q (v)中的第i个向量分别表示为
Figure PCTCN2021091214-appb-000029
Figure PCTCN2021091214-appb-000030
可选地,在本实施例中,以第二模态为文本模态为例,对于文本模态,为了缓解词袋向量的稀疏性对高层语义信息的挖掘带来的不利影响,在本发明中,首先使用由多个均值池化层和1×1的卷积层构成的多尺度融合模型对词袋向量进行处理。为了方便起见,将这个多尺度融合模型记为TxtNet MSF。这个多尺度融合模型TxtNet MSF有利于发现不同词之间的关系,进而有利于挖掘文本模态数据中所蕴含的高层语义信息。为了更好地挖掘文本模态数据中的高层语义信息,在文本模态的深度神经网络TxtNet中还包含一个文本自编码器(Text Autoencoder),这里将这个文本自编码器记为TxtNet Auto,并将TxtNet Auto的编码层的输出特征和TxtNet Auto的输出特征分别记为F (t)=f (t)(T;θ (t))和Q (t)=q (t)(T;θ (t)),其中,θ (t)表示文本模态的深度神经网络TxtNet的参数。进一步,分别将F (t)和Q (t)中的第i个向量表示为
Figure PCTCN2021091214-appb-000031
Figure PCTCN2021091214-appb-000032
可选地,在本实施例中,以第一模态为图像模态、第二模态为文本模态为例,为了进一步缩小图像模态与文本模态之间的语义鸿沟,本发明将对抗学习策略应用于特征F (l)、F (v)和F (t)的学习过程。为此,本发明设计两个“模态间鉴别器”来完成对抗学习策略在不同模态之间的鉴别任务,这两个鉴别器分别是:标记-图像鉴别器D L-I(对应于前述的第一鉴别器)和标记-文本鉴别器D L-T(对应于前述的第二鉴别器)。
对于标记-图像鉴别器D L-I来说,它的输入数据为LabNet Auto的输出特征F (l)和ImgNet Auto的输出特征F (v)。假设
Figure PCTCN2021091214-appb-000033
表示指定给特征向量
Figure PCTCN2021091214-appb-000034
的标签,
Figure PCTCN2021091214-appb-000035
表示指定给特征向量
Figure PCTCN2021091214-appb-000036
的标签,其中,i=1,2,...,n。鉴别器D L-I旨在尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000037
与“虚假数据”
Figure PCTCN2021091214-appb-000038
区分开来。
因此,可以用“0”和“1”分别表示鉴别器D L-I的两种可能的输出,具体来说,用“1”表示鉴别器D L-I进行了正确的区分,用“0”表示鉴别器D L-I进行了错误的区分。
综合以上分析,针对鉴别器D L-I可以设计如下的目标函数:
Figure PCTCN2021091214-appb-000039
其中,
Figure PCTCN2021091214-appb-000040
表示鉴别器D L-I的参数,D L-I(·)表示鉴别器D L-I的输出。
鉴别器D L-T的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000041
与“虚假数据”
Figure PCTCN2021091214-appb-000042
区分开来,其中,i=1,2,...,n。类似于鉴别器D L-I
因此,设计如下的目标函数实现鉴别器D L-T所要达到的目标:
Figure PCTCN2021091214-appb-000043
其中,
Figure PCTCN2021091214-appb-000044
表示鉴别器D L-T的参数,D L-T(·)鉴别器D L-T的输出,
Figure PCTCN2021091214-appb-000045
表示指定给特征向量
Figure PCTCN2021091214-appb-000046
的标签。
通过本实施例,对
Figure PCTCN2021091214-appb-000047
Figure PCTCN2021091214-appb-000048
进行最小化,可以实现以语义特征F (l)为桥梁将图像模态和文本模态有效地关联起来,进而可以缓解不同模态之间的语义鸿沟,能够解决相关技术中存在的难以有效地实现跨模态的数据处理,用于进行跨模态数据处理的方法的性能较差的技术问题,达到提高跨模态数据处理的效率,优化跨模态的数据处理性能的技术效果。
在一个可选的实施例中,所述方法还包括:基于所述第一训练数据以及第二训练数据生成三元组集,其中,所述三元组集中的每个三元组包括被选为锚点的第一训练数据、与所述第一训练数据具有相同标记的第二训练数据以及与所述第一训练数据具有不同标记的第二训练数据;通过目标函数最小化所述被选为锚点的第一训练数据与所述第一训练数据具有相同标记的第二训练数据之间的欧氏距离;通过目标函数最大化所述被选为锚点的第一训练数据与所述第一训练数据具有不同标记的第二训练数据之间的欧氏距离;得到约束后的所述第一训练数据和约束后的所述第二训练数据。
可选地,在本实施例中,以上述第一模态为图像模态,第二模态为文本模态为例,可以包括如下内容:
例如,在缩小不同模态中语义上相同的对象的差异时,增大每个模态中语义上不同的对象的距离,有利于保持模态内对象之间的语义关系并增强模态间的语义关联。为此,本发明将三元组约束应用到图像模态和文本模态的特征学习过程。具体做法为:首先构建形式为
Figure PCTCN2021091214-appb-000049
的三元组集,其中,v i是被选为锚点的图像特征向量,
Figure PCTCN2021091214-appb-000050
为来自于文本模态且与v i具有相同标记的文本向量,
Figure PCTCN2021091214-appb-000051
为来自于文本模态且与v i具有不同标记的文本向量。将由v i
Figure PCTCN2021091214-appb-000052
联合起来构成的图像-文本对
Figure PCTCN2021091214-appb-000053
称为正图像-文本对,类似地,将由v i
Figure PCTCN2021091214-appb-000054
联合起来构成的图像-文本对
Figure PCTCN2021091214-appb-000055
称为负图像-文本对。当将t i作为锚点时,可以构造形如
Figure PCTCN2021091214-appb-000056
的三元组集。进一步,可以构造正文本-图像对
Figure PCTCN2021091214-appb-000057
和负文本-图像对
Figure PCTCN2021091214-appb-000058
对于以图像模态的样本为锚点一个三元组来说,三元组约束旨在通过三元组损失函数最小化锚点和正文本样本之间距离并同时最大化锚点与负文本样本之间的距离。也就是说,对于三元组
Figure PCTCN2021091214-appb-000059
三元组损失函数定义为:
Figure PCTCN2021091214-appb-000060
其中,
Figure PCTCN2021091214-appb-000061
Figure PCTCN2021091214-appb-000062
Figure PCTCN2021091214-appb-000063
之间的欧氏距离,
Figure PCTCN2021091214-appb-000064
Figure PCTCN2021091214-appb-000065
Figure PCTCN2021091214-appb-000066
之间的欧氏距离。因此,图像模态所有三元组的三元组损失函数为:
Figure PCTCN2021091214-appb-000067
类似地,文本模态所有三元组的三元组损失函数为:
Figure PCTCN2021091214-appb-000068
因此,基于三元组损失函数的目标函数设计为:
Figure PCTCN2021091214-appb-000069
根据上述内容可以看出,通过使用三元组约束可以使图像模态数据和文本模态数据的语义分布相互适应,进而不同模态之间的语义鸿沟可以得到消减。此外,通过使用三元组约束还可以使图像模态特有的信息和文本模态特有的信息得以保持。
在一个可选的实施例中,在将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,所述方法还包括:使用符号函数处理所述第一训练数据,得到第一组哈希编码;将所述第一组哈希编码输入第三鉴别器,得到第三鉴别结果;将所述第三鉴别结果确定为第三训练结果;基于所述第三训练结果训练所述第三鉴别器和第一编码器,其中,所述第一初始神经网络模型包括所述第一编码器;使用所述符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和所述第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
可选地,在本实施例中,通过公式(3)和公式(4)可知,在图像模态和文本模态生成哈希编码,需要将ImgNet Auto和TxtNet Auto的编码层特征F (v)和F (t)分别使用符号函数进行处理,进而得到哈希编码。为了使生成的哈希编码中保留尽可能多的鉴别信息,亦即使学习得到的编码层特征F (v)和F (t)中保留尽可能多的鉴别信息,可以通过设法保证ImgNet Auto和TxtNet Auto得到有效训练来实现。为此,本发明将对抗学习策略引入到图像模态和文本模态的深度神经网络训练过程中。本发明设计两个“模态内鉴别器”分别完成对抗学习策略在每个模态内部的鉴别任务,这两个鉴别器分别是:图像模态鉴别器D I(对应于前述的第三鉴别器)和文本模态鉴别器D T(对应于前述的第四鉴别器)。
对于鉴别器D I来说,它的输入数据为ImgNet CNN的输出特征G (v)和ImgNet Auto的输出特征Q (v)。假设
Figure PCTCN2021091214-appb-000070
表示指定给特征向量
Figure PCTCN2021091214-appb-000071
的标签,
Figure PCTCN2021091214-appb-000072
表示指定给特征向量
Figure PCTCN2021091214-appb-000073
的标签,其中,i=1,2,...,n。鉴别器D I的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000074
与它相应的重构数据
Figure PCTCN2021091214-appb-000075
区分开来。因此,可以用“0”和“1”分别表示鉴别器D I的两种可能的输出,具体来说,用“1”表示鉴别器D I进行了正确的区分,用“0”表示鉴别器D I进行了错误的区分。综合以上分析,针对鉴别器D I可以设计如下的目标函数:
Figure PCTCN2021091214-appb-000076
其中,
Figure PCTCN2021091214-appb-000077
表示鉴别器D I的参数,D I(·)表示鉴别器D I的输出。
鉴别器D T的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000078
与它相应的重构数据
Figure PCTCN2021091214-appb-000079
区分开来,其中,i=1,2,...,n。类似于鉴别器D I,设计如下的目标函数实现鉴别器D T所要达到的目标:
Figure PCTCN2021091214-appb-000080
其中,
Figure PCTCN2021091214-appb-000081
表示鉴别器D T的参数,D T(·)鉴别器D T的输出,
Figure PCTCN2021091214-appb-000082
表示指定给特征向量
Figure PCTCN2021091214-appb-000083
的标签,
Figure PCTCN2021091214-appb-000084
表示指定给特征向量
Figure PCTCN2021091214-appb-000085
的标签。
在一个可选的实施例中,在将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,所述方法还包括:使用符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
可选地,在本实施例中,假设图像模态的一个查询样本的特征向量为
Figure PCTCN2021091214-appb-000086
文本模态的一个查询样本的特征向量为
Figure PCTCN2021091214-appb-000087
图像模态检索样本集中样本的特征向量集为
Figure PCTCN2021091214-appb-000088
文本模态检索样本集中样本的特征向量集为
Figure PCTCN2021091214-appb-000089
其中,
Figure PCTCN2021091214-appb-000090
表示检索样本集中样本的数量。图像模态和文本模态查询样本和检索样本集中样本的哈希编码分别为:
Figure PCTCN2021091214-appb-000091
Figure PCTCN2021091214-appb-000092
其中,θ (v)和θ (t)分别为求解得到的图像模态和文本模态的深度神经网络参数,
Figure PCTCN2021091214-appb-000093
sign(·)为符号函数。
在一个可选的实施例中,基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型,包括以下至少之一:基于所述第一训练结果和所述第二训练结果使用后向传播算法确定所述目标神经网络模型的参数;基于所述第一训练结果和所述第二训练结果使用随机梯度下降算法确定所述目标神经网络模型的参数。
可选地,在本实施例中,在为图像模态和文本模态学习深度特征表示时,目标函数公式中包含的未知变量有
Figure PCTCN2021091214-appb-000094
θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))。这些未知变量可以通过联合优化公式(12)和公式(13)所示的生成损失函数和对抗损失函数来得到解。
Figure PCTCN2021091214-appb-000095
Figure PCTCN2021091214-appb-000096
因为公式(12)和公式(13)的优化目标是相反的,本发明采用“极大极小博弈(Minimax Game)”方案对公式(14)进行优化来求解未知变量。
Figure PCTCN2021091214-appb-000097
因为B (l),B (v)和B (t)都是离散变量,并且“极大极小”损失函数容易引起梯度消失问题,因此,公式(14)的优化问题是非常棘手的优化问题。为了解决这个问题,本发明采用迭代优化方案来优化公式(14)。首先通过优化
Figure PCTCN2021091214-appb-000098
来求解θ (l)和B (l),然后固定θ (l)和B (l)通过优化
Figure PCTCN2021091214-appb-000099
来求解θ (v)和B (v),类似地,固定θ (l)和B (l)通过优化
Figure PCTCN2021091214-appb-000100
来求解θ (t)和B (t)。不难看出,在上述求解θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))的过程中,图像模态和文本模态的特征表示可以在标签信息的监督下学习得到。将求解得到的θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))固定,通过分别优化
Figure PCTCN2021091214-appb-000101
Figure PCTCN2021091214-appb-000102
可以求解得到
Figure PCTCN2021091214-appb-000103
Figure PCTCN2021091214-appb-000104
本发明采用后向传播算法以及随机梯度下降完成网络参数的学习。
下面结合具体的示例,对本申请进行进一步地说明:
图3是根据本发明实施例的一种可选的跨模态的数据处理方法的示意图,如图3所示,具体实施过程主要包括以下步骤:假设(V,T)表示n个对象在图像模态和文本模态的图像-文本数据对,其中,
Figure PCTCN2021091214-appb-000105
为n个对象的像素特征向量集,v i表示第i个对象在图像模态的像素特征向量,
Figure PCTCN2021091214-appb-000106
为这n个对象的词袋向量集,其中,t i表示第i个对象的词袋向量。假设n个对象的类别标记向量为
Figure PCTCN2021091214-appb-000107
l i=[l i1,l i2,...,l ic] T(i=1,2,...,n)表示第i个对象的标签,其中,c表示对象类别的数量,(·) T表示转置运算。对于向量l i来说,如果第i个对象属于第k类,则l ik=1,否则,l ik=0。使用语义相似矩阵
Figure PCTCN2021091214-appb-000108
,来表示两个对象之间的相似程度,如果第i个对象与第j个对象在语义上相似,则s ij=1,否则,s ij=0。
(1)基于卷积神经网络和自编码器构建多模态混合深度神经网络
利用图像模态和文本模态的特征数据,以及对象的类别标记信息学习图像模态和文本模态的哈希函数,并利用学习得到的哈希函数生成用于完成跨模态哈希检索的哈希编码。对于本发明的跨模态检索方案,在图像模态首先使用卷积神经网络进行高层语义特征学习,为了方便起见,这里将所使用的卷积神经网络记为ImgNet CNN302并将ImgNet CNN的输出特征记为G (v)=g (v)(V;θ (v))。进一步,用
Figure PCTCN2021091214-appb-000109
表示G (v)中的第i个向量,且该向量对应于v i。本发明中的图像模态的深度神经网络还包含一个图像自编码器(Image Autoencoder)304,用于进一步挖掘图像模态数据中所蕴含的高层语义信息。为了描述方便,这里将这个图像自编码器表示为ImgNet Auto,并将ImgNet Auto的编码层的输出特征和ImgNet Auto的输出特征分别记为F (v)=f (v)(V;θ (v))和Q (v)=q (v)(V;θ (v)),其中,θ (v)表示图像模态的深度神经网络ImgNet306的参数。进一步,将F (v)和Q (v)中的第i个向量分别表示为
Figure PCTCN2021091214-appb-000110
Figure PCTCN2021091214-appb-000111
对于文本模态,为了缓解词袋向量的稀疏性对高层语义信息的挖掘带来的不利影响,在本发明中,首先使用由多个均值池化层和1×1的卷积层构成的多尺度融合模型308对词袋向量进行处理。为了方便起见,将这个多尺度融合模型记为TxtNet MSF。这个多尺度融合模型TxtNet MSF有利于发现不同词之间的关系,进而有利于挖掘文本模态数据中所蕴含的高层语义信息。为了更好地挖掘文本模态数据中的高层语义信息,在文本模态的深度神经网络TxtNet 310中还包含一个文本自编码器(Text Autoencoder)312,这里将这个文本自编码器记为TxtNet Auto,并将
Figure PCTCN2021091214-appb-000112
的编码层的输出特征和TxtNet Auto的输出特征分别记为F (t)=f (t)(T;θ (t))和Q (t)=q (t)(T;θ (t)),其中,θ (t)表示文本模态的深度神经网络TxtNet的参数。进一步,分别将F (t)和Q (t)中的第i个向量表示为
Figure PCTCN2021091214-appb-000113
Figure PCTCN2021091214-appb-000114
(2)基于模态间对抗学习和三元组约束构建提升深度学习特征鉴别性能的模型
本发明方法还包含一个神经网络LabNet 314,它是输入数据为类别标记数据的深度神经网络。LabNet由一个自编码器构成,为了方便起见,这里将该自编码器记为LabNet Auto316,并将LabNet Auto的编码层的输出特征记为F (l)=f (l)(L;θ (l)),其中,θ (l)为深度神经网络LabNet的参数。F (l)可以看作由LabNet Auto学习得到的语义特征。本发明利用LabNet Auto的编码层的输出特征F (l)作为监督信息,引导ImgNet和TxtNet更好地进行训练,从而实现缩小图像模态和文本模态之间的语义鸿沟,并使图像模态和文本模态更好地从语义上关联起来。为了达到上述目的,LabNet Auto需要经过良好的训练,为此,本发明采用如下所示的目标函数训练LabNet Auto
Figure PCTCN2021091214-appb-000115
其中,
Figure PCTCN2021091214-appb-000116
为与标记向量l i相对应的LabNet Auto的编码层的输出向量,α (l)为超参数,B (l)为哈希编码。公式(1)中的
Figure PCTCN2021091214-appb-000117
为负对数似然函数,且似然函数的定义如下:
Figure PCTCN2021091214-appb-000118
其中,
Figure PCTCN2021091214-appb-000119
Figure PCTCN2021091214-appb-000120
用于保持F (l)中不同特征向量间的相似性。
Figure PCTCN2021091214-appb-000121
为用于控制哈希编码B (l)的量化误差的目标函数项。
为了将LabNet Auto学习得到的语义特征F (l)用于监督图像模态和文本模态的特征学习过程,本发明设计如下的目标:
Figure PCTCN2021091214-appb-000122
Figure PCTCN2021091214-appb-000123
其中,
Figure PCTCN2021091214-appb-000124
α (v)和α (t)为超参数,B (v)和B (t)分别为图像模态和文本模态的哈希编码。最小化公式(3)和公式(4)中的两个负对数似然函数
Figure PCTCN2021091214-appb-000125
Figure PCTCN2021091214-appb-000126
等价于最大化它们相应的似然函数。当s ij=1时,最小化
Figure PCTCN2021091214-appb-000127
可以使得
Figure PCTCN2021091214-appb-000128
Figure PCTCN2021091214-appb-000129
之间的相似度变大,与此相反,当s ij=0时,最小化
Figure PCTCN2021091214-appb-000130
可以使得
Figure PCTCN2021091214-appb-000131
Figure PCTCN2021091214-appb-000132
之间的相似度变小。对
Figure PCTCN2021091214-appb-000133
进行最小化优化也可以实现类似的目标。因此,对
Figure PCTCN2021091214-appb-000134
Figure PCTCN2021091214-appb-000135
进行最小化,可以实现以语义特征F (l)为桥梁将图像模态和文本模态有效地关联起来,进而可以缓解不同模态之间的语义鸿沟。本发明将衡量成对数据之间关系的损失函数
Figure PCTCN2021091214-appb-000136
Figure PCTCN2021091214-appb-000137
分别称为成对损失。
为了进一步缩小图像模态与文本模态之间的语义鸿沟,本发明将对抗学习策略应用于特征F (l)、F (v)和F (t)的学习过程。为此,本发明设计两个“模态间鉴别器”来完成对抗学习策略在不同模态之间的鉴别任务,这两个鉴别器分别是:标记-图像鉴别器D L-I318和标记-文本鉴别器D L-T320。
对于标记-图像鉴别器D L-I来说,它的输入数据为LabNet Auto的输出特征F (l)和ImgNet Auto的输出特征F (v)。假设
Figure PCTCN2021091214-appb-000138
表示指定给特征向量
Figure PCTCN2021091214-appb-000139
的标签,
Figure PCTCN2021091214-appb-000140
表示指定给特征向量
Figure PCTCN2021091214-appb-000141
的标签,其中,i=1,2,...,n。鉴别器D L-I旨在尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000142
与“虚假数据”
Figure PCTCN2021091214-appb-000143
区分开来。因此,可以用“0”和“1”分别表示鉴别器D L-I的两种可能的输出,具体来说,用“1”表示鉴别器D L-I进行了正确的区分,用“0”表示鉴别器D L-I进行了错误的区分。综合以上分析,针对鉴别器D L-I可以设计如下的目标函数:
Figure PCTCN2021091214-appb-000144
其中,
Figure PCTCN2021091214-appb-000145
表示鉴别器D L-I的参数,D L-I(·)表示鉴别器D L-I的输出。
鉴别器D L-T的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000146
与“虚假数据”
Figure PCTCN2021091214-appb-000147
区分开来,其中,i=1,2,...,n。类似于鉴别器D L-I,设计如下的目标函数实现鉴别器D L-T所要达到的目标:
Figure PCTCN2021091214-appb-000148
其中,
Figure PCTCN2021091214-appb-000149
表示鉴别器D L-T的参数,D L-T(·)鉴别器D L-T的输出,
Figure PCTCN2021091214-appb-000150
表示指定给特征向量
Figure PCTCN2021091214-appb-000151
的标签。
在缩小不同模态中语义上相同的对象的差异时,增大每个模态中语义上不同的对象的距离,有利于保持模态内对象之间的语义关系并增强模态间的语义关联。为此,本发明将三元组约束应用到图像模态和文本模态的特征学习过程。具体做法为:首先构建形式为
Figure PCTCN2021091214-appb-000152
的三元组集,其中,v i是被选为锚点的图像特征向量,
Figure PCTCN2021091214-appb-000153
为来自于文本模态且与v i具有相同标记的文本向量,
Figure PCTCN2021091214-appb-000154
为来自于文本模态且与v i具有不同标记的文本向量。将由v i
Figure PCTCN2021091214-appb-000155
联合起来构成的图像-文本对
Figure PCTCN2021091214-appb-000156
称为正图像-文本对,类似地,将由v i
Figure PCTCN2021091214-appb-000157
联合起来构成的图像-文本对
Figure PCTCN2021091214-appb-000158
称为负图像-文本对。当将t i作为锚点时,可以构造形如
Figure PCTCN2021091214-appb-000159
的三元组集。进一步,可以构造正文本-图像对
Figure PCTCN2021091214-appb-000160
和负文本-图像对
Figure PCTCN2021091214-appb-000161
对于以图像模态的样本为锚点一个三元组来说,三元组约束322旨在通过三元组损失函数最小化锚点和正文本样本之间距离并同时最大化锚点与负文本样本之间的距离。也就是说,对于三元组
Figure PCTCN2021091214-appb-000162
三元组损失函数定义为:
Figure PCTCN2021091214-appb-000163
其中,
Figure PCTCN2021091214-appb-000164
Figure PCTCN2021091214-appb-000165
Figure PCTCN2021091214-appb-000166
之间的欧氏距离,
Figure PCTCN2021091214-appb-000167
Figure PCTCN2021091214-appb-000168
Figure PCTCN2021091214-appb-000169
之间的欧氏距离。因此,图像模态所有三元组的三元组损失函数为:
Figure PCTCN2021091214-appb-000170
类似地,文本模态所有三元组的三元组损失函数为:
Figure PCTCN2021091214-appb-000171
因此,基于三元组损失函数的目标函数设计为:
Figure PCTCN2021091214-appb-000172
根据上述内容可以看出,通过使用三元组约束可以使图像模态数据和文本模态数据的语义分布相互适应,进而不同模态之间的语义鸿沟可以得到消减。此外,通过使用三元组约束还可以使图像模态特有的信息和文本模态特有的信息得以保持。
(3)基于模态内对抗学习构建提升哈希编码鉴别性能的模型
观察公式(3)和公式(4)可以发现,为了在图像模态和文本模态生成哈希编码,需要将ImgNet Auto和TxtNet Auto的编码层特征F (v)和F (t)分别使用符号函数进行处理,进而得到哈希编码。为了使生成的哈希编码中保留尽可能多的鉴别信息,亦即使学习得到的编码层特征F (v)和F (t)中保留尽可能多的鉴别信息,可以通过设法保证ImgNet Auto和TxtNet Auto得到有效训练来实现。为此,本发明将对抗学习策略引入到图像模态和文本模态的深度神经网络训练过程中。本发明设计两个“模态内鉴别器”分别完成对抗学习策略在每个模态内部的鉴别任务,这两个鉴别器分别是:图像模态鉴别器D I324和文本模态鉴别器D T326。
对于鉴别器D I来说,它的输入数据为ImgNet CNN的输出特征G (v)和ImgNet Auto的输出特征Q (v)。假设
Figure PCTCN2021091214-appb-000173
表示指定给特征向量
Figure PCTCN2021091214-appb-000174
的标签,
Figure PCTCN2021091214-appb-000175
表示指定给特征向量
Figure PCTCN2021091214-appb-000176
的标签,其中,i=1,2,...,n。鉴别器D I的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000177
与它相应的重构数据
Figure PCTCN2021091214-appb-000178
区分开来。因此,可以用“0”和“1”分别表示鉴别器D I的两种可能的输出,具体来说,用“1”表示鉴别器D I进行了正确的区分,用“0”表示鉴别器D I进行了错误的区分。综合以上分析,针对鉴别器D I可以设计如下的目标函数:
Figure PCTCN2021091214-appb-000179
其中,
Figure PCTCN2021091214-appb-000180
表示鉴别器D I的参数,D I(·)表示鉴别器D I的输出。
鉴别器D T的作用是尽可能地将“真实数据”
Figure PCTCN2021091214-appb-000181
与它相应的重构数据
Figure PCTCN2021091214-appb-000182
区分开来,其中, i=1,2,...,n。类似于鉴别器D I,设计如下的目标函数实现鉴别器D T所要达到的目标:
Figure PCTCN2021091214-appb-000183
其中,
Figure PCTCN2021091214-appb-000184
表示鉴别器D T的参数,D T(·)鉴别器D T的输出,
Figure PCTCN2021091214-appb-000185
表示指定给特征向量
Figure PCTCN2021091214-appb-000186
的标签,
Figure PCTCN2021091214-appb-000187
表示指定给特征向量
Figure PCTCN2021091214-appb-000188
的标签。
(4)所构建模型中未知变量的求解
在为图像模态和文本模态学习深度特征表示时,目标函数公式中包含的未知变量有
Figure PCTCN2021091214-appb-000189
θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))。这些未知变量可以通过联合优化公式(12)和公式(13)所示的生成损失函数和对抗损失函数来得到解。
Figure PCTCN2021091214-appb-000190
Figure PCTCN2021091214-appb-000191
因为公式(12)和公式(13)的优化目标是相反的,本发明采用“极大极小博弈(Minimax Game)”方案对公式(14)进行优化来求解未知变量。
Figure PCTCN2021091214-appb-000192
因为B (l),B (v)和B (t)都是离散变量,并且“极大极小”损失函数容易引起梯度消失问题,因此,公式(14)的优化问题是非常棘手的优化问题。为了解决这个问题,本发明采用迭代优化方案来优化公式(14)。首先通过优化
Figure PCTCN2021091214-appb-000193
来求解θ (l)和B (l),然后固定θ (l)和B (l)通过优化
Figure PCTCN2021091214-appb-000194
来求解θ (v)和B (v),类似地,固定θ (l)和B (l)通过优化
Figure PCTCN2021091214-appb-000195
来求解θ (t)和B (t)。不难看出,在上述求解θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))的过程中,图像模态和文本模态的特征表示可以在标签信息的监督下学习得到。将求解得到的θ=(θ (l)(v)(t))和B=(B (l),B (v),B (t))固定,通过分别优化
Figure PCTCN2021091214-appb-000196
Figure PCTCN2021091214-appb-000197
可以求解得到
Figure PCTCN2021091214-appb-000198
Figure PCTCN2021091214-appb-000199
本发明采用后向传播算法以及随机梯度下降完成网络参数的学习。
(5)查询样本和检索样本集中样本的哈希编码的生成
假设图像模态的一个查询样本的特征向量为
Figure PCTCN2021091214-appb-000200
文本模态的一个查询样本的特征向量为
Figure PCTCN2021091214-appb-000201
图像模态检索样本集中样本的特征向量集为
Figure PCTCN2021091214-appb-000202
文本模态检索样本集中样本的特征向量集为
Figure PCTCN2021091214-appb-000203
其中,
Figure PCTCN2021091214-appb-000204
表示检索样本集中样本的数量。图像模态和文本模态查询样本和检索样本集中样本的哈希编码分别为:
Figure PCTCN2021091214-appb-000205
Figure PCTCN2021091214-appb-000206
Figure PCTCN2021091214-appb-000207
其中,θ (v)和θ (t)分别为求解得到的图像模态和文本模态的深度神经网络参数,
Figure PCTCN2021091214-appb-000208
sign(·)为符号函数。
(6)计算汉明距离与完成跨模态检索
在计算查询样本到检索样本集中各个样本的汉明距离时,对于图像模态的查询样本
Figure PCTCN2021091214-appb-000209
使用距离计算公式
Figure PCTCN2021091214-appb-000210
计算图像模态的查询样本
Figure PCTCN2021091214-appb-000211
到文本模态检索样本集中样本
Figure PCTCN2021091214-appb-000212
的汉明距离。对于文本模态的查询样本
Figure PCTCN2021091214-appb-000213
使用距离计算公式
Figure PCTCN2021091214-appb-000214
计算文本模态的查询样本
Figure PCTCN2021091214-appb-000215
到图像模态检索样本集中样本
Figure PCTCN2021091214-appb-000216
Figure PCTCN2021091214-appb-000217
的汉明距离。对于用图像去检索文本的跨模态检索任务,首先对计算得到的
Figure PCTCN2021091214-appb-000218
个汉明距离
Figure PCTCN2021091214-appb-000219
按照从小到大的顺序进行排序,然后,在文本检索样本集中取前K个最小距离对应的样本作为检索结果。类似地,对于用文本去检索图像的跨模态检索任务,首先对计算得到的
Figure PCTCN2021091214-appb-000220
个汉明距离
Figure PCTCN2021091214-appb-000221
按照从小到大的顺序进行排序,然后,在图像检索样本集中取前K个最小距离对应的样本作为检索结果。
以下结合具体实验对本发明的有益效果进行说明。
本发明在Pascal VOC 2007数据集上进行实验说明其有益效果。Pascal VOC 2007数据集包含来自于20个类别的9963张图像,每幅图像均被标注了标签。数据集被划分成包含5011个图像-标签对的训练集和包含4952个图像-标签对的测试集。图像模态使用原始像素特征作为输入特征。文本模态使用399维的词频特征作为输入特征。实验主要完成用图像检索文本和用文本检索图像这两种跨模态检索任务,为了方便起见,这里将这两种跨模态检索任务分别用Img2Txt和Txt2Img表示。实验在评价跨模态哈希检索方法的性能时使用MAP(Mean Average Precision)这一评价指标。MAP值越大说明跨模态检索的性能越好。实验采用5折交叉验证来确定本发明方法中超参数的值。对比方法中的参数按照各个方法推荐的参数设置原则进行参数设置。报告的结果为进行10次随机实验所得结果的平均值。
与本发明方法进行对比的方法分别为:(1)文献“Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval”(作者E.Yang,C.Deng,W.Liu,X.Liu,D.Tao,and X.Gao)中的PRDH方法;(2)文献“MHTN:Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval”(作者X.Huang,Y.Peng,and M.Yuan)中的MHTN方法;(3)文献“Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval”(作者C.Li,C.Deng,N.Li,W.Liu,X.Gao,and D.Tao)中的SSAH方法。表1列出了本发明方法和对比方法在Pascal VOC 2007数据集上进行跨模态哈希检索时的MAP。从表1可以看出,对于两种检索任务Img2Txt和Txt2Img,本发明方法的跨模态检索性能均优于PRDH、MHTN和SSAH方法。这说明本发明方法是有效的深度跨模态哈希检索方法。这同时也说明本发明基于对抗学习、三元组约束等技术设计的提升特征鉴别力的方案是有效的。
表1 各方法在Pascal VOC 2007数据集上的MAP
方法 Img2Txt Txt2Img 平均
PRDH 0.5371 0.5434 0.5425
MHTN 0.5557 0.5582 0.5570
SSAH 0.5790 0.5885 0.5838
本发明 0.6034 0.6168 0.6101
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如 ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
在本实施例中还提供了一种跨模态的数据处理装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图4是根据本发明实施例的一种可选的跨模态的数据处理装置的结构框图,如图4所示,该装置包括:
获取模块402,设置为获取第一模态的查询数据;
处理模块404,设置为分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的预设参数,以得到多个预设参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述预设参数用于指示所述第一模态的查询数据与所述第二模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;
确定模块406,设置为根据所述多个预设参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
在一个可选的实施例中,所述装置还设置为:在获取第一模态的查询数据之前,重复执行以下步骤,直到为所述鉴别器所配置的目标函数的取值最小:获取第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据;将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果;基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型。
在一个可选的实施例中,所述装置还设置为通过如下方式将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果:将所述第一训练数据输入第一编码器,得到第一目标数据,将所述第二训练数据输入第二编码器,得到第二目标数据;将所述类别标记数据输入标记编码器,得到标签数据;将所述第一目标数据和所述标签数据输入第一鉴别器,得到第一鉴别结果,将所述第二目标数据和所述标签数据输入第二鉴别器,得到第二鉴别结果;将所述第一鉴别结果确定为所述第一训练结果, 并将所述第二鉴别结果确定为所述第二训练结果。
在一个可选的实施例中,所述装置还设置为通过如下至少之一的方式:基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型:基于所述第一训练结果和所述第二训练结果使用后向传播算法确定所述目标神经网络模型的参数;基于所述第一训练结果和所述第二训练结果使用随机梯度下降算法确定所述目标神经网络模型的参数。
在一个可选的实施例中,所述装置还设置为:基于所述第一训练数据以及第二训练数据生成三元组集,其中,所述三元组集中的每个三元组包括被选为锚点的第一训练数据、与所述第一训练数据具有相同标记的第二训练数据以及与所述第一训练数据具有不同标记的第二训练数据;通过目标函数最小化所述被选为锚点的第一训练数据与所述第一训练数据具有相同标记的第二训练数据之间的欧氏距离;通过目标函数最大化所述被选为锚点的第一训练数据与所述第一训练数据具有不同标记的第二训练数据之间的欧氏距离;得到约束后的所述第一训练数据和约束后的所述第二训练数据。
在一个可选的实施例中,所述装置还设置为:在将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,使用符号函数处理所述第一训练数据,得到第一组哈希编码;将所述第一组哈希编码输入第三鉴别器,得到第三鉴别结果;将所述第三鉴别结果确定为第三训练结果;基于所述第三训练结果训练所述第三鉴别器和第一编码器,其中,所述第一初始神经网络模型包括所述第一编码器。
在一个可选的实施例中,所述装置还设置为:在将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,使用符号函数处理所述第二训练数据,得到第二组哈希编码;将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;将所述第四鉴别结果确定为第四训练结果;基于所述第四训练结果训练所述第四鉴别器和第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
本发明的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在本实施例中,上述计算机可读存储介质可以被设置为存储用于执行以下步骤的计算机程序:S1,获取第一模态的查询数据;S2,分别确定第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,第二模态的检索数据集合中包含多个第二模态的检索数据,第二模态的检索数据为将第二模态的原 始数据输入目标神经网络模型后得到的数据,目标参数用于指示第一模态的查询数据与第二模态的检索数据的相似性,目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,目标神经网络模型包括编码器和鉴别器,编码器包括样本编码器和类别标记编码器,每个样本对包括样本数据以及类别标记数据,使得样本数据输入样本编码器所得到的数据和类别标记数据输入类别标记编码器所得到的数据无法被鉴别器区分开;S3,根据多个目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据。
计算机可读存储介质还被设置为存储用于执行以下步骤的计算机程序:S1,获取第一模态的查询数据;S2,分别确定第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,第二模态的检索数据集合中包含多个第二模态的检索数据,第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,目标参数用于指示第一模态的查询数据与第二模态的检索数据的相似性,目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,目标神经网络模型包括编码器和鉴别器,编码器包括样本编码器和类别标记编码器,每个样本对包括样本数据以及类别标记数据,使得样本数据输入样本编码器所得到的数据和类别标记数据输入类别标记编码器所得到的数据无法被鉴别器区分开;S3,根据多个目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本发明的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
在一个示例性实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:S1,获取第一模态的查询数据;S2,分别确定第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,第二模态的检索数据集合中包含多个第二模态的检索数据,第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,目标参数用于指示第一模态的查询数据与第二模态的检索数据的相似性,目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,目标神经网络模型包括编码器和鉴别器,编码器包括样本编码器和类别标记编码器,每个样本对包括样本数据以及类别标记数据,使得样本数据输入样本编码器所得到的数据和类别标记数据输入类别标记编码器所得到的数据无法被鉴别器区分开;S3,根据多个 目标参数将一个或多个第二模态的检索数据确定为与第一模态的查询数据对应的目标数据。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种跨模态的数据处理方法,包括:
    获取第一模态的查询数据;
    分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述目标参数用于指示所述第一模态的查询数据与所述第二模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;
    根据所述多个目标参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
  2. 根据权利要求1所述的方法,其中,在获取第一模态的查询数据之前,所述方法还包括:
    重复执行以下步骤,直到为所述鉴别器所配置的目标函数的取值最小:
    获取第一模态的第一训练数据和第二模态的第二训练数据以及类别标记数据;
    将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果;
    基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型。
  3. 根据权利要求2所述的方法,其中,将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果,包括:
    将所述第一训练数据输入第一编码器,得到第一目标数据,将所述第二训练数据输入第二编码器,得到第二目标数据;
    将所述类别标记数据输入标记编码器,得到标签数据;
    将所述第一目标数据和所述标签数据输入第一鉴别器,得到第一鉴别结果,将所述第二目标数据和所述标签数据输入第二鉴别器,得到第二鉴别结果;
    将所述第一鉴别结果确定为所述第一训练结果,并将所述第二鉴别结果确定为所述第二训练结果。
  4. 根据权利要求2所述的方法,其中,所述方法还包括:
    基于所述第一训练数据以及第二训练数据生成三元组集,其中,所述三元组集中的每个三元组包括被选为锚点的第一训练数据、与所述第一训练数据具有相同标记的第二训练数据以及与所述第一训练数据具有不同标记的第二训练数据;
    通过目标函数最小化所述被选为锚点的第一训练数据与所述第一训练数据具有相同标记的第二训练数据之间的欧氏距离;
    通过目标函数最大化所述被选为锚点的第一训练数据与所述第一训练数据具有不同标记的第二训练数据之间的欧氏距离;
    得到约束后的所述第一训练数据和约束后的所述第二训练数据。
  5. 根据权利要求2所述的方法,其中,在将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,所述方法还包括:
    使用符号函数处理所述第二训练数据,得到第二组哈希编码;
    将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;
    将所述第四鉴别结果确定为第四训练结果;
    基于所述第四训练结果训练所述第四鉴别器和第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
  6. 根据权利要求2所述的方法,其中,在将所述第一训练数据以及所述类别标记数据输入待训练的第一初始神经网络模型,得到第一训练结果,并将所述第二训练数据以及所述类别标记数据输入待训练的第二初始神经网络模型,得到第二训练结果之前,所述方法还包括:
    使用符号函数处理所述第一训练数据,得到第一组哈希编码;
    将所述第一组哈希编码输入第三鉴别器,得到第三鉴别结果;
    将所述第三鉴别结果确定为第三训练结果;
    基于所述第三训练结果训练所述第三鉴别器和所述第一编码器,其中,所述第一初始神经网络模型包括所述第一编码器;
    使用所述符号函数处理所述第二训练数据,得到第二组哈希编码;
    将所述第二组哈希编码输入第四鉴别器,得到第四鉴别结果;
    将所述第四鉴别结果确定为第四训练结果;
    基于所述第四训练结果训练所述第四鉴别器和所述第二编码器,其中,所述第二初始神经网络模型包括所述第二编码器。
  7. 根据权利要求2所述的方法,其中,基于所述第一训练结果以及所述第二训练结果,调整所述目标神经网络模型的预设参数,以得到所述目标神经网络模型,包括以下至少之一:
    基于所述第一训练结果和所述第二训练结果使用后向传播算法确定所述目标神经网络模型的参数;
    基于所述第一训练结果和所述第二训练结果使用随机梯度下降算法确定所述目标神经网络模型的参数。
  8. 一种跨模态的数据处理装置,包括:
    获取模块,设置为获取第一模态的查询数据;
    处理模块,设置为分别确定所述第一模态的查询数据与第二模态的检索数据集合中每个第二模态的检索数据之间的目标参数,以得到多个目标参数,其中,所述第二模态的检索数据集合中包含多个所述第二模态的检索数据,所述第二模态的检索数据为将第二模态的原始数据输入目标神经网络模型后得到的数据,所述目标参数用于指示所述第一模态的查询数据与所述第二模态的检索数据的相似性,所述目标神经网络模型是使用一组样本对对初始神经网络模型进行训练得到的神经网络模型,所述目标神经网络模型包括编码器和鉴别器,所述编码器包括样本编码器和类别标记编码器,每个所述样本对 包括样本数据以及类别标记数据,使得所述样本数据输入所述样本编码器所得到的数据和所述类别标记数据输入类别标记编码器所得到的数据无法被所述鉴别器区分开;
    确定模块,设置为根据所述多个目标参数将一个或多个所述第二模态的检索数据确定为与所述第一模态的查询数据对应的目标数据。
  9. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被处理器执行时实现所述权利要求1至7任一项中所述的方法的步骤。
  10. 一种电子装置,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现所述权利要求1至7任一项中所述的方法的步骤。
PCT/CN2021/091214 2020-09-30 2021-04-29 跨模态的数据处理方法、装置、存储介质以及电子装置 WO2022068195A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011063068.6 2020-09-30
CN202011063068.6A CN112199462A (zh) 2020-09-30 2020-09-30 跨模态的数据处理方法、装置、存储介质以及电子装置

Publications (1)

Publication Number Publication Date
WO2022068195A1 true WO2022068195A1 (zh) 2022-04-07

Family

ID=74013547

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091214 WO2022068195A1 (zh) 2020-09-30 2021-04-29 跨模态的数据处理方法、装置、存储介质以及电子装置

Country Status (2)

Country Link
CN (1) CN112199462A (zh)
WO (1) WO2022068195A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942984A (zh) * 2022-05-26 2022-08-26 北京百度网讯科技有限公司 视觉场景文本融合模型的预训练和图文检索方法及装置
CN115984302A (zh) * 2022-12-19 2023-04-18 中国科学院空天信息创新研究院 基于稀疏混合专家网络预训练的多模态遥感图像处理方法
CN116051830A (zh) * 2022-12-20 2023-05-02 中国科学院空天信息创新研究院 一种面向跨模态数据融合的对比语义分割方法
CN116049459A (zh) * 2023-03-30 2023-05-02 浪潮电子信息产业股份有限公司 跨模态互检索的方法、装置、服务器及存储介质
CN116431788A (zh) * 2023-04-14 2023-07-14 中电科大数据研究院有限公司 面向跨模态数据的语义检索方法
CN116825210A (zh) * 2023-08-28 2023-09-29 山东大学 基于多源生物数据的哈希检索方法、系统、设备和介质
CN117171934A (zh) * 2023-11-03 2023-12-05 成都大学 一种基于pod-anns的架空输电线路舞动响应预测方法
CN117194605A (zh) * 2023-11-08 2023-12-08 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199462A (zh) * 2020-09-30 2021-01-08 三维通信股份有限公司 跨模态的数据处理方法、装置、存储介质以及电子装置
CN113515657B (zh) * 2021-07-06 2022-06-14 天津大学 一种跨模态多视角目标检索方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256627A (zh) * 2017-12-29 2018-07-06 中国科学院自动化研究所 视听信息互生装置及其基于循环对抗生成网络的训练系统
CN109299342A (zh) * 2018-11-30 2019-02-01 武汉大学 一种基于循环生成式对抗网络的跨模态检索方法
US20190130257A1 (en) * 2017-10-27 2019-05-02 Sentient Technologies (Barbados) Limited Beyond Shared Hierarchies: Deep Multitask Learning Through Soft Layer Ordering
CN110059157A (zh) * 2019-03-18 2019-07-26 华南师范大学 一种图文跨模态检索方法、系统、装置和存储介质
CN110909181A (zh) * 2019-09-30 2020-03-24 中国海洋大学 一种面向多类型海洋数据的跨模态检索方法及系统
CN110990595A (zh) * 2019-12-04 2020-04-10 成都考拉悠然科技有限公司 一种跨域对齐嵌入空间的零样本跨模态检索方法
CN111581405A (zh) * 2020-04-26 2020-08-25 电子科技大学 基于对偶学习生成对抗网络的跨模态泛化零样本检索方法
CN112199462A (zh) * 2020-09-30 2021-01-08 三维通信股份有限公司 跨模态的数据处理方法、装置、存储介质以及电子装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130257A1 (en) * 2017-10-27 2019-05-02 Sentient Technologies (Barbados) Limited Beyond Shared Hierarchies: Deep Multitask Learning Through Soft Layer Ordering
CN108256627A (zh) * 2017-12-29 2018-07-06 中国科学院自动化研究所 视听信息互生装置及其基于循环对抗生成网络的训练系统
CN109299342A (zh) * 2018-11-30 2019-02-01 武汉大学 一种基于循环生成式对抗网络的跨模态检索方法
CN110059157A (zh) * 2019-03-18 2019-07-26 华南师范大学 一种图文跨模态检索方法、系统、装置和存储介质
CN110909181A (zh) * 2019-09-30 2020-03-24 中国海洋大学 一种面向多类型海洋数据的跨模态检索方法及系统
CN110990595A (zh) * 2019-12-04 2020-04-10 成都考拉悠然科技有限公司 一种跨域对齐嵌入空间的零样本跨模态检索方法
CN111581405A (zh) * 2020-04-26 2020-08-25 电子科技大学 基于对偶学习生成对抗网络的跨模态泛化零样本检索方法
CN112199462A (zh) * 2020-09-30 2021-01-08 三维通信股份有限公司 跨模态的数据处理方法、装置、存储介质以及电子装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN YING, HUANGKANG CHEN: "Speaker Recognition Based on Multimodal GenerativeAdversarial Nets with Triplet-loss", JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, ZHONGGUO KEXUEYUAN DIANZIXUE YANJIUSUO,CHINESE ACADEMY OF SCIENCES, INSTITUTE OF ELECTRONICS, CN, vol. 42, no. 2, 1 February 2020 (2020-02-01), CN , pages 379 - 385, XP055910312, ISSN: 1009-5896, DOI: 10.11999/JEIT190154 *
LI CHAO; DENG CHENG; LI NING; LIU WEI; GAO XINBO; TAO DACHENG: "Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 4242 - 4251, XP033476397, DOI: 10.1109/CVPR.2018.00446 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114942984B (zh) * 2022-05-26 2023-11-21 北京百度网讯科技有限公司 视觉场景文本融合模型的预训练和图文检索方法及装置
CN114942984A (zh) * 2022-05-26 2022-08-26 北京百度网讯科技有限公司 视觉场景文本融合模型的预训练和图文检索方法及装置
CN115984302A (zh) * 2022-12-19 2023-04-18 中国科学院空天信息创新研究院 基于稀疏混合专家网络预训练的多模态遥感图像处理方法
CN115984302B (zh) * 2022-12-19 2023-06-06 中国科学院空天信息创新研究院 基于稀疏混合专家网络预训练的多模态遥感图像处理方法
CN116051830A (zh) * 2022-12-20 2023-05-02 中国科学院空天信息创新研究院 一种面向跨模态数据融合的对比语义分割方法
CN116051830B (zh) * 2022-12-20 2023-06-20 中国科学院空天信息创新研究院 一种面向跨模态数据融合的对比语义分割方法
CN116049459A (zh) * 2023-03-30 2023-05-02 浪潮电子信息产业股份有限公司 跨模态互检索的方法、装置、服务器及存储介质
CN116049459B (zh) * 2023-03-30 2023-07-14 浪潮电子信息产业股份有限公司 跨模态互检索的方法、装置、服务器及存储介质
CN116431788A (zh) * 2023-04-14 2023-07-14 中电科大数据研究院有限公司 面向跨模态数据的语义检索方法
CN116431788B (zh) * 2023-04-14 2024-03-29 中电科大数据研究院有限公司 面向跨模态数据的语义检索方法
CN116825210A (zh) * 2023-08-28 2023-09-29 山东大学 基于多源生物数据的哈希检索方法、系统、设备和介质
CN116825210B (zh) * 2023-08-28 2023-11-17 山东大学 基于多源生物数据的哈希检索方法、系统、设备和介质
CN117171934A (zh) * 2023-11-03 2023-12-05 成都大学 一种基于pod-anns的架空输电线路舞动响应预测方法
CN117171934B (zh) * 2023-11-03 2024-01-26 成都大学 一种基于pod-anns的架空输电线路舞动响应预测方法
CN117194605A (zh) * 2023-11-08 2023-12-08 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质
CN117194605B (zh) * 2023-11-08 2024-01-19 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质

Also Published As

Publication number Publication date
CN112199462A (zh) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2022068195A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
WO2022068196A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
Wu et al. Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning
CN110309267B (zh) 基于预训练模型的语义检索方法和系统
US10438091B2 (en) Method and apparatus for recognizing image content
CN111353076B (zh) 训练跨模态检索模型的方法、跨模态检索的方法和相关装置
CN109472033B (zh) 文本中的实体关系抽取方法及系统、存储介质、电子设备
Xia et al. Supervised hashing for image retrieval via image representation learning
CN109783666B (zh) 一种基于迭代精细化的图像场景图谱生成方法
WO2022077646A1 (zh) 一种用于图像处理的学生模型的训练方法及装置
CN113177132B (zh) 基于联合语义矩阵的深度跨模态哈希的图像检索方法
CN111753189A (zh) 一种少样本跨模态哈希检索共同表征学习方法
CN112800292B (zh) 一种基于模态特定和共享特征学习的跨模态检索方法
Xie et al. Deep determinantal point process for large-scale multi-label classification
CN109960732B (zh) 一种基于鲁棒监督的深度离散哈希跨模态检索方法及系统
CN113806582B (zh) 图像检索方法、装置、电子设备和存储介质
US20230297617A1 (en) Video retrieval method and apparatus, device, and storage medium
CN113076465A (zh) 一种基于深度哈希的通用跨模态检索模型
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN110598022A (zh) 一种基于鲁棒深度哈希网络的图像检索系统与方法
CN111026887B (zh) 一种跨媒体检索的方法及系统
CN114943017A (zh) 一种基于相似性零样本哈希的跨模态检索方法
Zheng et al. Learning from the web: Webly supervised meta-learning for masked face recognition
CN116680407A (zh) 一种知识图谱的构建方法及装置
CN110929013A (zh) 一种基于bottom-up attention和定位信息融合的图片问答实现方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873858

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21873858

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21873858

Country of ref document: EP

Kind code of ref document: A1