WO2022247562A1 - 多模态数据检索方法、装置、介质及电子设备 - Google Patents

多模态数据检索方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2022247562A1
WO2022247562A1 PCT/CN2022/089241 CN2022089241W WO2022247562A1 WO 2022247562 A1 WO2022247562 A1 WO 2022247562A1 CN 2022089241 W CN2022089241 W CN 2022089241W WO 2022247562 A1 WO2022247562 A1 WO 2022247562A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature extraction
extraction network
retrieval
modality
Prior art date
Application number
PCT/CN2022/089241
Other languages
English (en)
French (fr)
Inventor
夏锦
文柯宇
黄媛媛
邵杰
王长虎
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Priority to US18/563,222 priority Critical patent/US20240233334A1/en
Publication of WO2022247562A1 publication Critical patent/WO2022247562A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to a multimodal data retrieval method, device, medium and electronic equipment.
  • Content-based multimodal matching technology has a large number of application scenarios in Internet business, including but not limited to image retrieval (such as image search), cross-modal retrieval (such as text search for image, image search for text, text search for Search video, etc.), text matching (search text by text).
  • image retrieval such as image search
  • cross-modal retrieval such as text search for image, image search for text, text search for Search video, etc.
  • text matching search text by text
  • the present disclosure provides a multimodal data retrieval method, the method comprising: inputting target retrieval data into a first feature extraction network corresponding to the modality of the target retrieval data, and obtaining the target retrieval data The data feature of the data; input the data feature into the second feature extraction network corresponding to the modality of the target retrieval data, and obtain the target retrieval feature corresponding to the target retrieval data, wherein each modality is respectively The weights are shared between the corresponding second feature extraction networks; the retrieval is performed according to the target retrieval features.
  • the present disclosure provides a multimodal data retrieval device, the device comprising: a first processing module for inputting target retrieval data into a first feature extraction network corresponding to the modality of the target retrieval data In the process, the data features of the target retrieval data are obtained; the second processing module is configured to input the data features into the second feature extraction network corresponding to the modality of the target retrieval data, and obtain the data features related to the target retrieval data.
  • the target retrieval features corresponding to the data wherein the weights are shared between the second feature extraction networks corresponding to the modalities; the retrieval module is configured to perform retrieval according to the target retrieval features.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method described in the first aspect are implemented.
  • the present disclosure provides an electronic device, including: a storage device, on which a computer program is stored; a processing device, configured to execute the computer program in the storage device, so as to implement the method described in the first aspect step.
  • the target retrieval features that are more suitable for multi-modal retrieval can be extracted through the first feature extraction network and the second feature extraction network respectively corresponding to the data of different modalities, and due to the differences between the various modalities
  • the second feature extraction network weight sharing can not only compress the number of parameters used in the entire network model, optimize the structure of the network model, improve the training efficiency of the network model, but also improve the performance of single-mode in any mode. Retrieval accuracy in the retrieval task of retrieval or cross-modal retrieval.
  • Fig. 1 is a flow chart showing a multimodal data retrieval method according to an exemplary embodiment of the present disclosure.
  • Fig. 2 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multimodal data retrieval method according to yet another exemplary embodiment of the present disclosure.
  • Figure 3 shows a multimodal retrieval network model including the first feature extraction network and the second feature extraction network.
  • Fig. 4 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multi-modal data retrieval method according to yet another exemplary embodiment of the present disclosure.
  • Fig. 5 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multi-modal data retrieval method according to yet another exemplary embodiment of the present disclosure.
  • Fig. 6 is a flow chart showing a multimodal data retrieval method according to yet another exemplary embodiment of the present disclosure.
  • Fig. 7 is a structural block diagram of a multimodal data retrieval device according to an exemplary embodiment of the present disclosure.
  • Fig. 8 is a structural block diagram of a multimodal data retrieval device according to yet another exemplary embodiment of the present disclosure.
  • FIG. 9 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flowchart showing a multimodal data retrieval method according to an exemplary embodiment of the present disclosure. As shown in Fig. 1 , the method includes steps 101 to 103.
  • step 101 target retrieval data is input into a first feature extraction network corresponding to the modality of the target retrieval data, and data features of the target retrieval data are acquired.
  • step 102 the data features are input into the second feature extraction network corresponding to the modality of the target retrieval data, and the target retrieval features corresponding to the target retrieval data are obtained, wherein each modality is respectively Weights are shared between corresponding second feature extraction networks.
  • step 103 a search is performed according to the target search feature.
  • the modality of the target retrieval data can be any modality, and the data to be retrieved can also be any skyscraper, and the modality can include, for example, a text modality, an image modality, a video modality, and the like.
  • the target retrieval data can be data in image mode, while the data to be retrieved is data in text mode, or the target retrieval data can be data in text mode, and the data to be retrieved is data in image mode, in this case
  • the data retrieval is cross-modal retrieval, that is, to retrieve the data most similar to the data of another modality among the data to be retrieved in one modality.
  • the data to be retrieved may also be data in an image modality
  • the target retrieval data is data in a text modality
  • the data to be retrieved is also data in a text modality
  • the data retrieval in this case is unimodal.
  • the actual content of the target retrieval data and the data to be retrieved can be determined in real time according to specific retrieval tasks.
  • the first feature extraction network may be one of one or more feature extraction networks capable of extracting data features of different modal data, for example, a text feature extraction network for feature extraction of text modal data It can be the first feature extraction network, and the visual feature extraction network for feature extraction of image modality data can also be the first feature extraction network, as long as it is a network that can perform feature extraction on the target retrieval data. is used as the first feature extraction network.
  • the specific content of the first feature extraction network is related to the modality of the target retrieval data, for example, if the modality of the target retrieval data is a text modality, then the first feature extraction network can be selected as the data for the text modality For the text feature extraction network for text feature extraction, if the modality of the target retrieval data is an image modality, the first feature extraction network can be selected as a visual feature extraction network for image feature extraction for image modality data.
  • the second feature extraction network is used to further extract the target retrieval feature according to the data features acquired by the first feature extraction network.
  • the input and output data dimensions of the second feature extraction network are the same, and the output matrix can be pooled to obtain the target retrieval feature. Since each modality not only has a corresponding second feature extraction network, but also the weights are shared between the corresponding second feature extraction networks, that is, when training the second feature extraction networks corresponding to each modality, the The representations learned separately can be taken into account, so that no matter what modality the final obtained second feature extraction network can learn a common representation that is consistent with the modality, so that not only can it be used in the case of cross-modal retrieval The lower extraction can get better target retrieval features for retrieval, and it can also have better retrieval accuracy in the retrieval task of single-modal retrieval.
  • the data can also be retrieved according to the modality of the data to be retrieved.
  • the corresponding first feature extraction network and the second feature extraction network are determined to obtain retrieval features corresponding to each data to be retrieved.
  • the target retrieval feature obtained in this way and the retrieval features of each data to be retrieved can have better retrieval accuracy.
  • the method for performing retrieval according to the target retrieval feature may be to search the target retrieval data in a retrieval database according to the target retrieval feature.
  • the retrieval database can also be determined according to the actual retrieval task. If the actual retrieval task is a single-modal retrieval for the text mode, the data to be retrieved included in the selected retrieval database can all be text-mode data. , if the actual retrieval task is to retrieve the data to be retrieved in the image modality according to the target retrieval data in the text modality, then the data to be retrieved in the retrieval database can all be data in the image modality.
  • the retrieval database may include the data to be retrieved, directly include the retrieval features corresponding to the data to be retrieved, or include both the data to be retrieved and the retrieval features corresponding to the data to be retrieved.
  • the retrieval database directly includes the retrieval features corresponding to the data to be retrieved
  • the retrieval features in the retrieval database may also be obtained through the first feature extraction network and the second feature extraction network corresponding to the modality of the data to be retrieved. The feature extraction network is obtained.
  • the similarity between the target retrieval feature and the retrieval features of each data to be retrieved can be calculated and sorted, and the higher the similarity is, the target data is semantically similar to the target retrieval data.
  • the target retrieval features that are more suitable for multi-modal retrieval can be extracted through the first feature extraction network and the second feature extraction network respectively corresponding to the data of different modalities, and due to the differences between the various modalities
  • the second feature extraction network weight sharing can not only compress the number of parameters used in the entire network model, optimize the structure of the network model, improve the training efficiency of the network model, but also improve the performance of single-mode in any mode. Retrieval accuracy in the retrieval task of retrieval or cross-modal retrieval.
  • Fig. 2 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multimodal data retrieval method according to yet another exemplary embodiment of the present disclosure.
  • the first feature extraction network and the second feature extraction network are obtained through pre-training, and the first feature extraction network and the second feature extraction network perform the pre-training at the same time.
  • the pre-training method includes steps 201 to 203 .
  • step 201 two or more first sample data with the same content but different modalities are respectively input into the first feature extraction network corresponding to the modalities of the first sample data to obtain the Describe the data characteristics of the first sample data.
  • step 202 the data features of the first sample data are respectively input into the second feature extraction network corresponding to the first sample data, and the retrieval corresponding to the first sample data is obtained. feature.
  • the first loss value is determined according to the difference between the retrieved features corresponding to the first sample data obtained in different modalities, and the corresponding The first feature extraction network and the second feature extraction network.
  • Figure 3 shows a multimodal retrieval network model including the first feature extraction network and the second feature extraction network.
  • the entire network model includes the first feature extraction network 10 and the second feature extraction network 20 .
  • Fig. 3 also schematically shows the visual feature extraction network 11 and the text feature extraction network 12 that may be included in the first feature extraction network 10, and the second feature extraction network 21 corresponding to the image modality and corresponding to the image modality respectively.
  • the second feature extraction network 22 corresponding to the text modality.
  • the image or video data 1 etc. is the data of image modality can obtain data feature through this visual feature extraction network 11, and this data feature is input in the second feature extraction network 21 corresponding to this image modality to carry out further feature extraction, with Get the final visual retrieval features3.
  • the text data 2 can be input into the text feature extraction network 12 to obtain data features
  • the data features can be input into the second feature extraction network 22 corresponding to the text modality for further feature extraction to obtain the final text retrieval features 4.
  • the first sample data includes a piece of text data whose data content is "Puppy", and there is also a piece of image data whose data content is "Puppy".
  • the text data input and the text feature extraction can be performed respectively.
  • Obtain the data feature in the network 12 then input the obtained feature data in the second feature extraction network 22 to obtain the text retrieval feature 4 corresponding to the text data; input the piece of image data in the visual feature extraction network 11 to obtain data features, and then input the data features into the second feature extraction network 21 to obtain the visual retrieval features 3 corresponding to the image data.
  • the first loss value is determined according to the difference between the retrieval features corresponding to the two sample data, and the text feature extraction network 12, the visual feature extraction network 11, and the second feature extraction network corresponding to each modality are adjusted according to the first loss value.
  • the first feature extraction network and the second feature extraction network can be pre-trained simultaneously according to the first sample data with different modalities but consistent content, so that the corresponding first feature extraction network and the The second feature extraction network can learn the representation of the relevant semantics of different modal data, so that the extracted retrieval features can pay more attention to the content meaning of the data, and reduce the impact of data modalities on retrieval feature data, thereby improving cross-modality. Precision when retrieving.
  • Fig. 4 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multi-modal data retrieval method according to yet another exemplary embodiment of the present disclosure. As shown in FIG. 4 , the pre-training method further includes steps 401 to 404 .
  • step 401 image enhancement is performed on second sample data belonging to an image modality or a video modality to obtain enhanced sample data corresponding to the second sample data.
  • the image enhancement method can be in any manner, and the image enhancement method is not limited in this disclosure.
  • step 402 input the second sample data and the enhanced sample data into the first feature extraction network corresponding to the image modality or the video modality, and obtain the second sample data respectively and the data features of the enhanced sample data.
  • step 403 input the data features of the second sample data and the enhanced sample data into the second feature extraction network corresponding to the image modality or video modality, and obtain the second sample data respectively The retrieval features corresponding to the data and the enhanced sample data.
  • a second loss value is determined according to the difference between the retrieval features corresponding to the second sample data and the enhanced sample data, and the image modality or the image modality is adjusted according to the second loss value.
  • the pre-training method also includes image self-supervised pre-training for the first feature extraction network and the second feature extraction network corresponding to the image modality or video modality, that is, training for single-modal retrieval, so
  • the retrieval accuracy of the first feature extraction network and the second feature extraction network for single-mode data can be guaranteed to a certain extent.
  • the weights of the second feature extraction network corresponding to each modality are shared, the second feature extraction network can be better learned to obtain different semantic representations in each modality, and thus to a certain extent Improving the accuracy of cross-modal retrieval.
  • Fig. 5 is a flow chart of a method for pre-training the first feature extraction network and the second feature extraction network in a multi-modal data retrieval method according to yet another exemplary embodiment of the present disclosure. As shown in Figure 5, the pre-training method also includes steps 501 to 504.
  • step 501 the original text content in the third sample data belonging to the text modality is randomly partially covered to obtain masked sample data corresponding to the third sample data.
  • step 502 the retrieval features corresponding to the mask data samples are extracted through the first feature extraction network corresponding to the text modality and the second feature extraction network corresponding to the text modality.
  • step 503 the predicted text covered by the random part in the mask sample data is predicted according to the retrieval feature corresponding to the mask data sample.
  • step 504 the difference between the predicted text and the original text content is determined as a third loss value, and the first feature extraction network and the first feature extraction network corresponding to the text modality are adjusted according to the third loss value.
  • the second feature extraction network described above.
  • the pre-training method there is also text self-supervised pre-training for the first feature extraction network and the second feature extraction network corresponding to the text modality, that is, another training for single-modal retrieval.
  • image self-supervised pre-training of image modality or video modality the retrieval accuracy of the first feature extraction network and the second feature extraction network for single-modal data can be guaranteed to a certain extent.
  • the weights of the second feature extraction network corresponding to each modality are shared, the second feature extraction network can be better learned to obtain different semantic representations in each modality, and thus to a certain extent Improving the accuracy of cross-modal retrieval.
  • Fig. 6 is a flow chart showing a multimodal data retrieval method according to yet another exemplary embodiment of the present disclosure. As shown in FIG. 6 , before step 101 , the method further includes step 601 to step 603 .
  • step 601 a target retrieval task is acquired.
  • step 602 the first feature extraction network and the second feature extraction network that need to be fine-tuned are determined according to the target modality corresponding to the target retrieval task.
  • step 603 fine-tuning training is performed on the first feature extraction network and the second feature extraction network according to the fourth sample data corresponding to the target retrieval task, and the first feature extraction network and the The second feature extraction network is replaced by the first feature extraction network trained by the fine-tuning and the second feature extraction network trained by the fine-tuning.
  • the first feature extraction network and the second feature extraction network related to the modality corresponding to the target retrieval task can be respectively Fine-tuning training is performed to adjust the first feature extraction network and the second feature extraction network related to the modality corresponding to the target retrieval task to a feature extraction network more suitable for the target retrieval task.
  • the training data of the text modality can be used to fine-tune the first feature extraction network and the second feature extraction network corresponding to the text modality , to adjust the network of the first feature extraction network and the second feature extraction network corresponding to the text modality, for example, the text feature extraction network 12 and the second feature extraction network 22 in FIG. 3 .
  • the method of fine-tuning training according to the training data of the text modality can also be the same as training the first feature extraction network and the second feature extraction network corresponding to the text modality in the above-mentioned pre-training process: for the first feature extraction network belonging to the text modality The original text content in the four sample data is randomly partially covered to obtain the mask sample data corresponding to the fourth sample data; through the first feature extraction network corresponding to the text modality and the second feature extraction network corresponding to the text modality Extracting the retrieval feature corresponding to the mask data sample corresponding to the fourth sample data; predicting the predicted text covered by the random part in the mask sample data corresponding to the fourth sample data according to the retrieval feature extracted in the previous step; combining the predicted text with The difference between the covered content in the original text content in the fourth sample data determines a loss value, and adjusts the first feature extraction network and the second feature extraction network corresponding to the text modality according to the loss value. Finally, retrieval is performed using the fine-tuned first feature extraction network and the first feature
  • the target retrieval task is a single-modal retrieval task for the image modality, it is only necessary to use the training data of the image modality to perform fine-tuning training on the first feature extraction network and the second feature extraction network corresponding to the image modality, To adjust the network of the first feature extraction network and the second feature extraction network corresponding to the text modality, for example, the visual feature extraction network 11 and the second feature extraction network 21 in FIG. 3 .
  • the target retrieval task is a cross-modal retrieval task between the text modality and the image modality
  • a feature extraction network and a second feature extraction network perform fine-tuning training, such as the visual feature extraction network 11, the text feature extraction network 12, the second feature extraction network 21, and the second feature extraction network 22 in FIG. 3 .
  • the relevant first feature extraction network and the second feature extraction network are further adjusted according to the target retrieval task, which can make the first feature extraction network and the second feature extraction network related to the target retrieval task
  • the second feature extraction network performs better in this target retrieval task.
  • the retrieval accuracy for this target retrieval task can be further improved.
  • the second feature extraction network is a Transformer model network.
  • the number of blocks of input data for different modalities can be different, for example, the data feature corresponding to the image modal can be 256 blocks, and the data size of the text modal Can be 30 pieces.
  • the data dimension output by the first feature extraction network corresponding to each modality can be adjusted to ensure that the final output of the second feature extraction network corresponding to each modality is the same.
  • the dimensions of the target retrieval data can be the same.
  • the visual feature extraction network 11 as shown in Figure 3 can be, for example, CNN (convolutional neural network), the input is an image, and the size is uniformly adjusted to 512*512 in the network, and obtained after the feature extraction of the network
  • the size of the visual feature map can be, for example, 16*16*2048, which is flattened to 256*2048, and finally mapped to 1024 dimensions through the fully connected layer as the output of the network module.
  • the text feature extraction network 12 as shown in Figure 3 can be, for example, LSTM or GRU based on a recurrent neural network, which is input as a piece of text, encoded as a 768-dimensional vector, and obtained after the feature extraction of the network 30*768-dimensional text
  • the features are also mapped to 1024 dimensions by the fully connected layer and then used as the output of the network module.
  • Fig. 7 is a structural block diagram of a multimodal data retrieval device according to an exemplary embodiment of the present disclosure.
  • the device includes: a first processing module 10, configured to input the target retrieval data into the first feature extraction network corresponding to the modality of the target retrieval data, and obtain the target retrieval data Data features; the second processing module 20, configured to input the data features into the second feature extraction network corresponding to the modality of the target retrieval data, and obtain the target retrieval features corresponding to the target retrieval data, Wherein, the weights are shared between the second feature extraction networks corresponding to each modality; the retrieval module 30 is configured to perform retrieval according to the target retrieval features.
  • the target retrieval features that are more suitable for multi-modal retrieval can be extracted through the first feature extraction network and the second feature extraction network respectively corresponding to the data of different modalities, and due to the differences between the various modalities
  • the second feature extraction network weight sharing can not only compress the number of parameters used in the entire network model, optimize the structure of the network model, improve the training efficiency of the network model, but also improve the performance of single-mode in any mode. Retrieval accuracy in the retrieval task of retrieval or cross-modal retrieval.
  • the first feature extraction network and the second feature extraction network are obtained through pre-training.
  • the first feature extraction network and the second feature extraction network perform the pre-training at the same time
  • the pre-training method includes: combining two or more features with the same content but different modes Input the first sample data into the first feature extraction network corresponding to the modality of the first sample data respectively to obtain the data features of the first sample data;
  • the data features of the data are respectively input into the second feature extraction network corresponding to the first sample data, and the retrieval features corresponding to the first sample data are obtained;
  • the difference between the retrieval features corresponding to a sample data determines a first loss value, and adjusts the first feature extraction network and the second feature extraction network corresponding to each modality according to the first loss value .
  • the pre-training method further includes: performing image enhancement on second sample data belonging to an image modality or a video modality to obtain enhanced sample data corresponding to the second sample data;
  • the second sample data and the enhanced sample data are input into the first feature extraction network corresponding to the image modality or the video modality, and the second sample data and the enhanced sample data are obtained respectively data features; input the data features of the second sample data and the enhanced sample data into the second feature extraction network corresponding to the image modality or video modality, and obtain the second sample data respectively Retrieval features corresponding to the enhanced sample data; determining a second loss value according to the difference between the second sample data and the search features respectively corresponding to the enhanced sample data, and adjusting according to the second loss value
  • the first feature extraction network and the second feature extraction network corresponding to the image modality or the video modality.
  • the pre-training method further includes: randomly partially covering the original text content in the third sample data belonging to the text modality to obtain a mask sample corresponding to the third sample data Data; extract the retrieval feature corresponding to the mask data sample through the first feature extraction network corresponding to the text modality and the second feature extraction network corresponding to the text modality; according to the mask
  • the retrieval feature corresponding to the code data sample predicts the predicted text covered by the random part in the mask sample data; the difference between the predicted text and the original text content is determined as a third loss value, and according to the first The three loss values adjust the first feature extraction network and the second feature extraction network corresponding to the text modality.
  • Fig. 8 is a structural block diagram of a multimodal data retrieval device according to yet another exemplary embodiment of the present disclosure.
  • the The device before the first processing module inputs the target retrieval data into the first feature extraction network corresponding to the modality of the target retrieval data, and obtains the data features of the target retrieval data, the The device also includes: an acquisition module 40, configured to acquire a target retrieval task; a determination module 50, configured to determine, according to the target modality corresponding to the target retrieval task, the first feature extraction network and the The second feature extraction network; fine-tuning module 60, configured to perform fine-tuning training on the first feature extraction network and the second feature extraction network according to the fourth sample data corresponding to the target retrieval task, and use the The first feature extraction network and the second feature extraction network are replaced by the first feature extraction network trained by the fine-tuning and the second feature extraction network trained by the fine-tuning.
  • the retrieval module 30 is further configured to: retrieve the target retrieval data in a retrieval database according to the target retrieval features, and the retrieval database includes the data to be retrieved and/or the A retrieval feature corresponding to the data to be retrieved, wherein the retrieval feature corresponding to the data to be retrieved is obtained through the first feature extraction network and the second feature extraction network corresponding to the data to be retrieved.
  • the second feature extraction network is a Transformer model network.
  • FIG. 9 shows a schematic structural diagram of an electronic device 900 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 9 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
  • an electronic device 900 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 901, which may be randomly accessed according to a program stored in a read-only memory (ROM) 902 or loaded from a storage device 908.
  • a processing device such as a central processing unit, a graphics processing unit, etc.
  • RAM read-only memory
  • various appropriate actions and processes are executed by programs in the memory (RAM) 903 .
  • RAM 903 In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored.
  • the processing device 901, ROM 902, and RAM 903 are connected to each other through a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904 .
  • the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 907 such as a computer; a storage device 908 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 909.
  • the communication means 909 may allow the electronic device 900 to perform wireless or wired communication with other devices to exchange data. While FIG. 9 shows electronic device 900 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 909, or from storage means 908, or from ROM 902.
  • the processing device 901 When the computer program is executed by the processing device 901, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: inputs the target retrieval data into the first module corresponding to the modality of the target retrieval data In a feature extraction network, the data features of the target retrieval data are obtained; the data features are input into the second feature extraction network corresponding to the modality of the target retrieval data, and the data corresponding to the target retrieval data is obtained.
  • the target retrieval features wherein the weights are shared between the second feature extraction networks corresponding to the modalities; the retrieval is performed according to the target retrieval features.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances.
  • the first processing module can also be described as "inputting the target retrieval data into the first module corresponding to the modality of the target retrieval data.”
  • a module for obtaining the data features of the target retrieval data In a feature extraction network, a module for obtaining the data features of the target retrieval data".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a multimodal data retrieval method, including: inputting target retrieval data into a first feature extraction network corresponding to the modality of the target retrieval data, Obtain the data features of the target retrieval data; input the data features into the second feature extraction network corresponding to the modality of the target retrieval data, and obtain the target retrieval features corresponding to the target retrieval data, wherein , the weights are shared among the second feature extraction networks respectively corresponding to the modalities; the retrieval is performed according to the target retrieval features.
  • Example 2 provides the method of Example 1, the first feature extraction network and the second feature extraction network are obtained through pre-training.
  • Example 3 provides the method of Example 2, the first feature extraction network and the second feature extraction network perform pre-training at the same time, and the pre-training method includes: consistent content
  • two or more first sample data with different modalities are respectively input into the first feature extraction network corresponding to the modalities of the first sample data to obtain the data of the first sample data feature
  • respectively input the data features of the first sample data into the second feature extraction network corresponding to the first sample data and obtain the retrieval feature corresponding to the first sample data
  • the difference between the retrieved features corresponding to the first sample data acquired in different modalities determines a first loss value, and adjusts the first feature extraction corresponding to each modality according to the first loss value network and the second feature extraction network.
  • Example 4 provides the method of Example 3, the pre-training method further includes: performing image enhancement on the second sample data belonging to the image modality or video modality, and obtaining the same Enhanced sample data corresponding to the second sample data; input the second sample data and the enhanced sample data into the first feature extraction network corresponding to the image modality or the video modality, and obtain the respective The data features of the second sample data and the enhanced sample data; input the data features of the second sample data and the enhanced sample data into the second feature extraction corresponding to the image modality or video modality In the network, respectively obtain the retrieval features corresponding to the second sample data and the enhanced sample data; determine the second loss according to the difference between the retrieval features corresponding to the second sample data and the enhanced sample data value, and adjust the first feature extraction network and the second feature extraction network corresponding to the image modality or the video modality according to the second loss value.
  • Example 5 provides the method of Example 3, the pre-training method further includes: performing random partial covering on the original text content in the third sample data belonging to the text modality, and obtaining the same Mask sample data corresponding to the third sample data; extracting the mask through the first feature extraction network corresponding to the text modality and the second feature extraction network corresponding to the text modality The retrieval feature corresponding to the data sample; predicting the predicted text covered by the random part in the mask sample data according to the retrieval feature corresponding to the mask data sample; comparing the difference between the predicted text and the original text content Determine a third loss value, and adjust the first feature extraction network and the second feature extraction network corresponding to the text modality according to the third loss value.
  • Example 6 provides the method of Example 2, in which the target retrieval data is input into the first feature extraction network corresponding to the modality of the target retrieval data, and the target retrieval Before the data feature of the data, the method further includes: acquiring a target retrieval task; determining the first feature extraction network and the second feature that need to be fine-tuned for training according to the target modality corresponding to the target retrieval task Extracting a network; performing fine-tuning training on the first feature extraction network and the second feature extraction network according to the fourth sample data corresponding to the target retrieval task, and performing fine-tuning training on the first feature extraction network and the second feature extraction network
  • the second feature extraction network is replaced by the first feature extraction network trained by the fine-tuning and the second feature extraction network trained by the fine-tuning.
  • Example 7 provides the method of any one of Examples 1-6, the searching according to the target retrieval feature includes: searching the search database according to the target retrieval feature
  • the search database includes the data to be searched and/or the search feature corresponding to the data to be searched, wherein the search feature corresponding to the data to be searched is obtained through the search feature corresponding to the data to be searched
  • the first feature extraction network and the second feature extraction network are obtained.
  • Example 8 provides the method of any one of Examples 1-6, and the second feature extraction network is a Transformer model network.
  • Example 9 provides a multimodal data retrieval device, the device comprising: a first processing module for inputting target retrieval data into a modality related to the target retrieval data In the corresponding first feature extraction network, the data features of the target retrieval data are obtained; the second processing module is used to input the data features into the second feature extraction network corresponding to the modality of the target retrieval data In this method, the target retrieval feature corresponding to the target retrieval data is obtained, wherein the weights are shared between the second feature extraction networks corresponding to the respective modalities; the retrieval module is configured to perform retrieval according to the target retrieval feature.
  • Example 10 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-8 are implemented .
  • Example 11 provides an electronic device, including: a storage device on which a computer program is stored; a processing device configured to execute the computer program in the storage device to Implement the steps of any one of the methods described in Examples 1-8.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开涉及一种多模态数据检索方法、装置、介质及电子设备,包括:将目标检索数据输入与目标检索数据的模态相对应的第一特征提取网络中,获取目标检索数据的数据特征;将数据特征输入与目标检索数据的模态相对应的第二特征提取网络中,获取与目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;根据目标检索特征进行检索。这样,能够提取得到表现更好的目标检索特征,且由于该各个模态之间的第二特征提取网络权重共享,不仅能够优化网络模型的结构,提高网络模型的训练效率,而且还提高了无论是在任何模态的单模态检索或跨模态检索的检索任务中的检索精度。

Description

多模态数据检索方法、装置、介质及电子设备
相关申请的交叉引用
本申请要求于2021年05月25日提交的,申请号为202110573402.0、发明名称为“多模态数据检索方法、装置、介质及电子设备”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及数据处理领域,具体地,涉及一种多模态数据检索方法、装置、介质及电子设备。
背景技术
基于内容的多模态匹配技术在互联网业务中有着大量的应用场景,包括但不限于图像检索(如以图搜图)、跨模态检索(如以文搜图、以图搜文、以文搜视频等)、文本匹配(以文搜文)。为了获得更好的匹配精度,现有技术中在处理跨模态检索任务的情况下,需要将不同模态的数据拼接作为网络模型的输入,并提取该拼接数据的数据特征之后,再进行跨模态检索,该过程在实际应用时非常低效,达不到实际场景下的速度要求。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种多模态数据检索方法,所述方法包括:将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;根据所述目标检索特征进行检索。
第二方面,本公开提供一种多模态数据检索装置,所述装置包括:第一处理模块,用于将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;第二处理模块,用于将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;检索模块,用于根据所述目 标检索特征进行检索。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面中所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现第一方面所述方法的步骤。
通过上述技术方案,能够通过与不同模态的数据分别对应的第一特征提取网络和该第二特征提取网络提取得到更加适合多模态检索的目标检索特征,并且由于该各个模态之间的第二特征提取网络权重共享,不仅能够压缩整个网络模型中所使用的参数数量,优化了网络模型的结构,提高了网络模型的训练效率,而且还提高了无论是在任何模态的单模态检索或跨模态检索的检索任务中的检索精度。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据本公开一示例性实施例示出的一种多模态数据检索方法的流程图。
图2是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。
图3给出了包括该第一特征提取网络和该第二特征提取网络的多模态检索网络模型。
图4是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。
图5是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。
图6是根据本公开又一示例性实施例示出的一种多模态数据检索方法的流程图。
图7是根据本公开一示例性实施例示出的一种多模态数据检索装置的结构框图。
图8是根据本公开又一示例性实施例示出的一种多模态数据检索装置的结构框图。
图9示出了适于用来实现本公开实施例的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里 阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据本公开一示例性实施例示出的一种多模态数据检索方法的流程图,如图1所示,所述方法包括步骤101至步骤103。
在步骤101中,将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征。
在步骤102中,将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享。
在步骤103中,根据所述目标检索特征进行检索。
该目标检索数据的模态可以为任意模态,需要被检索的待检索数据也可以为任意摩天,该模态可以包括例如文本模态、图像模态、视频模态等等。例如,目标检索数据可以为图像模态的数据,而待检索数据为文本模态的数据,或者目标检索数据可以为文本模态的数据,而待检索数据为图像模态的数据,此情况下的数据检索为跨模态检索,也即在一种模态的待检索数据中检索与另一种模态的数据最相近的数据。或者,该目标检索数据为图像模态的数据的情况下,该待检索数据也可以为图像模态的数据,该目标检索数据为文本模态的数据,待检索数据也为文本模态的数据,此情况下的数据检索为单模态检索。无论该目标检索数据和待检索数据的模态相同或不相同,都能够通过本公开进行检索。该目标检索数据和待检索数据的实际内容可以根据具体的检索任务来实时地确定。
该第一特征提取网络可以是一个或多个分别能对不同模态数据进行数据特征提取的特征提取网络中的其中一个,例如,用于对文本模态的数据进行特征提取的文本特征提取网络可以为该第一特征提取网络,用于对图像模态的数据进行特征提取的视觉特征提取网络也可以为该第一特征提取网络,只要是能够对该目标检索数据进行特征提取的网络都可以被作为该第一特征提取网络。
该第一特征提取网络的具体内容与该目标检索数据的模态相关,例如,若目标检索数据的模态为文本模态,则该第一特征提取网络则可以选择为针对文本模态的数据进行文本特征提取的文本特征提取网络,若目标检索数据的模态为图像模态,则该第一特征提取网络则可以选择为针对图像模态的数据进行图像特征提取的视觉特征提取网络。
该第二特征提取网络用于根据所述第一特征提取网络获取到的数据特征进一步提取得到该目标检索特征。该第二特征提取网络的输入与输出数据维度相同,输出的矩阵经过池化操作便能够得到该目标检索特征。由于各个模态不仅有分别对应的第二特征提取网络,而且分别对应的该第二特征提取网络之间是权重共享的,也即在训练各个模态分别对应的该第二特征提取网络时所分别学习到的表示都能够被考虑到,从而使得最终得到的无论是什么模态对应的该第二特征提取网络都能够学习到模态一致的公共表示,从而不仅能够在跨模态检索的情况下提取得到更好的目标检索特征用于检索,而且也能够在单模态检索的检索任务中具有更好的检索精度。
在通过所述第一特征提取网络和该第二特征提取网络获取得到该目标检索数据对应的目标检索特征之后,在对该目标检索数据进行检索时,也可以分别根据该待检索数据的模态确定对应的该第一特征提取网络和该第二特征提取网络,以获取各个待检索数据对应的检索特征。这样得到的该目标检索特征和各个待检索数据的检索特征之间能够具有更好的检索精度。
在一种可能的实施方式中,根据所述目标检索特征进行检索的方法可以是根据所述目标检索特征在检索数据库中对所述目标检索数据进行检索。该检索数据库也可以根据实际的检索任务来确定,若实际检索任务为针对文本模态的单模态检索,则选择的该检索数据库中所包括的待检索数据则可以都为文本模态的数据,若实际检索任务为根据文本模态的目标检索数据在图像模态的待检索数据中进行检索,则该检索数据库中的待检索数据则可以都为图像模态的数据。另外,该检索数据库中可以包括上述待检索数据,也可以直接包括所述待检索数据对应的检索特征,或者既包括该待检索数据,也包括该待检索数据对应的检索特征。在该检索数据库中直接包括该待检索数据对应的检索特征的情况下,该检索数据库中的检索特征也可以是通过与该待检索数据的模态对应的该第一特征提取网络和该第二特征提取网络获取得到。
检索过程中可以分别计算该目标检索特征与各个待检索数据的检索特征之间的相似度,并排序,该相似度越高的即为与该目标检索数据语义相似的目标数据。
通过上述技术方案,能够通过与不同模态的数据分别对应的第一特征提取网络和该第二特征提取网络提取得到更加适合多模态检索的目标检索特征,并且由于该各个模态之间的第二特征提取网络权重共享,不仅能够压缩整个网络模型中所使用的参数数量,优化了网络模型的结构,提高了网络模型的训练效率,而且还提高了无论是在任何模态的单模态检索或跨模态检索的检索任务中的检索精度。
图2是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。其中,第一特征提取网络和第二特征提取网络通过预训练训练得到,所述第一特征提取网络和所述第二特征提取网络同时进行该预训练。如图2所示,所述预训练的方法包括步骤201至步骤203。
在步骤201中,将内容一致但模态不同的两个或多个第一样本数据分别输入与所述第一样本数据的模态相对应的所述第一特征提取网络中,得到所述第一样本数据的数据特征。
在步骤202中,将所述第一样本数据的数据特征分别输入与所述第一样本数据相对应的所述第二特征提取网络中,获取与所述第一样本数据对应的检索特征。
在步骤203中,根据获取的不同模态的所述第一样本数据所对应的所述检索特征之间的差异确定第一损失值,并根据所述第一损失值调整各模态对应的所述第一特征提取网络和所述第二特征提取网络。
图3给出了包括该第一特征提取网络和该第二特征提取网络的多模态检索网络模型。如图3所示,整个网络模型中包括该第一特征提取网络10和该第二特征提取网络20。图3中还示例性示出了该第一特征提取网络10中可能包括的视觉特征提取网络11和文本特征提取网络12,以及分别与图像模态对应的该第二特征提取网络21和与该文本模态对应的第二特征提取网络22。图像或视频数据1等为图像模态的数据可以通过该视觉特征提取网络11获取数据特征,并将该数据特征输入与该图像模态对应的第二特征提取网络21中进行进一步特征提取,以得到最终的视觉检索特征3。而文本数据2则可以输入该文本特征提取网络12中获取数据特征,并将该数据特征输入与文本模态对应的该第二特征提取网络22中进一步进行特征提取,以得到最终的文本检索特征4。
下面通过图3中示出的示例性网络模型对如图2中所示的预训练方法进行描述。该第一样本数据中包括其中一条数据内容为“小狗”的文本数据,也有一条数据内容为“小狗”的图像数据,此时,可以分别将该条文本数据输入与该文本特征提取网络12中获取数据特征,然后将获取到的特征数据输入该第二特征提取网络22中以获取与该文本数据对应的文本检索特征4;将该条图像数据输入该视觉特征提取网络11中获取数据特征,然后将该数 据特征输入该第二特征提取网络21中以获取与该图像数据对应的视觉检索特征3。最后根据两条样本数据所对应的所述检索特征之间的差异确定第一损失值,并根据所述第一损失值调整各模态对应的文本特征提取网络12、视觉特征提取网络11、第二特征提取网络22和第二特征提取网络21中的参数。由此,采用不同模态数据对比学习的方法,对不同模态分别对应的该第一特征提取网络和该第二特征提取网络进行了预训练。
通过上述技术方案,能够根据不同模态但内容一致的该第一样本数据同时对该第一特征提取网络和该第二特征提取网络进行预训练,从而使得相应的第一特征提取网络和该第二特征提取网络能够学习到不同模态数据的相关语义的表示,使得提取得到的该检索特征能够更关注数据的内容含义,而减少数据模态对检索特征数据的影响,进而提高跨模态检索时的精度。
图4是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。如图4所示,所述预训练方法还包括步骤401至步骤404。
在步骤401中,对属于图像模态或视频模态的第二样本数据进行图像增强,得到与所述第二样本数据对应的增强样本数据。图像增强的方法可以为任意方式,本公开中对图像增强的方法不进行限制。
在步骤402中,将所述第二样本数据和所述增强样本数据输入与所述图像模态或所述视频模态对应的所述第一特征提取网络中,分别获取所述第二样本数据和所述增强样本数据的数据特征。
在步骤403中,将所述第二样本数据和所述增强样本数据的数据特征输入与所述图像模态或视频模态对应的所述第二特征提取网络中,分别获取所述第二样本数据和所述增强样本数据对应的检索特征。
在步骤404中,根据所述第二样本数据和所述增强样本数据分别对应的检索特征之间的差异确定第二损失值,并根据所述第二损失值调整所述图像模态或所述视频模态对应的所述第一特征提取网络和所述第二特征提取网络。
也即,该预训练方法中还有针对该图像模态或视频模态对应的第一特征提取网络和第二特征提取网络的图像自监督预训练,也即针对单模态检索的训练,这样能够从一定程度上保证该第一特征提取网络和第二特征提取网络对于单模态数据的检索精度。并且,由于各模态对应的该第二特征提取网络之间权重共享,因此能够使得该第二特征提取网络更好的学习得到各个模态下的不同语义的表示,从而也能在一定程度上提高跨模态检索的精度。
图5是根据本公开又一示例性实施例示出的一种多模态数据检索方法中对第一特征提取网络和所述第二特征提取网络进行预训练的方法的流程图。如图5所示,所述预训练方 法还包括步骤501至步骤504。
在步骤501中,对属于文本模态的第三样本数据中的原始文本内容进行随机部分遮盖,得到与所述第三样本数据对应的掩码样本数据。
在步骤502中,通过与所述文本模态对应的所述第一特征提取网络和与所述文本模态对应的所述第二特征提取网络提取所述掩码数据样本对应的检索特征。
在步骤503中,根据所述掩码数据样本对应的检索特征预测所述掩码样本数据中被随机部分遮盖的预测文本。
在步骤504中,将所述预测文本和所述原始文本内容之间差异确定第三损失值,并根据所述第三损失值调整所述文本模态对应的所述第一特征提取网络和所述第二特征提取网络。
也即,该预训练方法中还有针对该文本模态对应的第一特征提取网络和第二特征提取网络的文本自监督预训练,也即另一个针对单模态检索的训练,这样与针对图像模态或视频模态的图像自监督预训练一样,能够从一定程度上保证该第一特征提取网络和第二特征提取网络对于单模态数据的检索精度。并且,由于各模态对应的该第二特征提取网络之间权重共享,因此能够使得该第二特征提取网络更好的学习得到各个模态下的不同语义的表示,从而也能在一定程度上提高跨模态检索的精度。
图6是根据本公开又一示例性实施例示出的一种多模态数据检索方法的流程图。如图6所示,在所述步骤101之前,所述方法还包括步骤601至步骤603。
在步骤601中,获取目标检索任务。
在步骤602中,根据所述目标检索任务所对应的目标模态确定需要进行所述微调训练的所述第一特征提取网络和所述第二特征提取网络。
在步骤603中,根据所述目标检索任务所对应的第四样本数据,对所述第一特征提取网络和所述第二特征提取网络进行微调训练,并将所述第一特征提取网络和所述第二特征提取网络替换为经过所述微调训练的第一特征提取网络和经过所述微调训练的第二特征提取网络。
也即,除了上述对网络模型的预训练之外,在实际获取到该目标检索任务之后,可以针对该目标检索任务所对应的模态相关的第一特征提取网络和该第二特征提取网络分别进行微调训练,将与该目标检索任务所对应的模态相关的第一特征提取网络和第二特征提取网络调整为更适用于该目标检索任务的特征提取网络。例如,若该目标检索任务为针对文本模态的单模态检索任务,则可以使用文本模态的训练数据对与文本模态对应的该第一特征提取网络该第二特征提取网络进行微调训练,以对与文本模态对应的该第一特征提取网络该第二特征提取网络进行网络的调整,例如图3中的文本特征提取网络12和第二特征提 取网络22。其中,根据该文本模态的训练数据进行微调训练的方法也可以如上述预训练过程中训练对应于文本模态的第一特征提取网络和第二特征提取网络相同:对属于文本模态的第四样本数据中的原始文本内容进行随机部分遮盖,得到与第四样本数据对应的掩码样本数据;通过与文本模态对应的第一特征提取网络和与文本模态对应的第二特征提取网络提取第四样本数据对应的掩码数据样本对应的检索特征;根据上一步骤提取得到的检索特征预测第四样本数据对应的掩码样本数据中被随机部分遮盖的预测文本;将该预测文本和该第四样本数据中的原始文本内容中被遮盖的内容之间的差异确定损失值,并根据该损失值调整上述与文本模态对应的该第一特征提取网络和该第二特征提取网络。最后利用微调后的该第一特征提取网络和该第二特征提取网络进行检索。其中,进行该微调训练的该第四样本数据可以针对该目标检索任务来独立获取,也可以是进行该预训练时所使用的训练数据。
若该目标检索任务为针对图像模态的单模态检索任务,则只需使用图像模态的训练数据对与图像模态对应的该第一特征提取网络该第二特征提取网络进行微调训练,以对与文本模态对应的该第一特征提取网络该第二特征提取网络进行网络的调整,例如图3中的视觉特征提取网络11和第二特征提取网络21。若该目标检索任务为在文本模态和图像模态中间的跨模态检索任务,则需要使用图像模态和文本模态的训练数据来同时对与图像模态和文本模态分别对应的第一特征提取网络和第二特征提取网络进行微调训练,例如图3中的视觉特征提取网络11、文本特征提取网络12、第二特征提取网络21和第二特征提取网络22。
通过上述技术方案,经过该微调训练,根据该目标检索任务对相关的第一特征提取网络和第二特征提取网络进行进一步调整,能够使得与该目标检索任务相关的第一特征提取网络和该第二特征提取网络在该目标检索任务中表现更好,相比于直接使用预训练后得到的第一特征提取网络和第二特征提取网络,针对该目标检索任务的检索精度能够进一步提高。
在一种可能的实施方式中,所述第二特征提取网络为Transformer模型网络。在该第二特征提取网络为Transformer模型网络的情况下,针对不同模态的输入数据的块的数量可以各不相同,例如图像模态对应的数据特征可以为256块,文本模态的数据尺寸可以为30块。但为了保证不同模态的数据最终获取得到的数据维度相同,可以通过调整各模态对应的该第一特征提取网络输出的数据维度以保证各模态对应的该第二特征提取网络最终输出的目标检索数据的维度能相同。例如,如图3中所示的视觉特征提取网络11可以是例如CNN(卷积神经网络),输入是一张图像,在网络中统一调整尺寸大小到512*512,经过网络的特征提取之后得到的视觉特征图大小可以为例如16*16*2048,将其拉平为256*2048,最后 再经过全连接层映射到1024维作为该网络模块的输出。如图3中所示的文本特征提取网络12可以是例如基于循环神经网络的LSTM或GRU,输入为一段文本,将其编码为768维的向量,经过网络的特征提取之后得到30*768维文本的特征,最后同样也经过全联接层映射到1024维之后再作为该网络模块的输出。
图7是根据本公开一示例性实施例示出的一种多模态数据检索装置的结构框图。如图7所示,所述装置包括:第一处理模块10,用于将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;第二处理模块20,用于将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;检索模块30,用于根据所述目标检索特征进行检索。
通过上述技术方案,能够通过与不同模态的数据分别对应的第一特征提取网络和该第二特征提取网络提取得到更加适合多模态检索的目标检索特征,并且由于该各个模态之间的第二特征提取网络权重共享,不仅能够压缩整个网络模型中所使用的参数数量,优化了网络模型的结构,提高了网络模型的训练效率,而且还提高了无论是在任何模态的单模态检索或跨模态检索的检索任务中的检索精度。
在一种可能的实施方式中,所述第一特征提取网络和所述第二特征提取网络通过预训练得到。
在一种可能的实施方式中,所述第一特征提取网络和所述第二特征提取网络同时进行所述预训练,所述预训练方法包括:将内容一致但模态不同的两个或多个第一样本数据分别输入与所述第一样本数据的模态相对应的所述第一特征提取网络中,得到所述第一样本数据的数据特征;将所述第一样本数据的数据特征分别输入与所述第一样本数据相对应的所述第二特征提取网络中,获取与所述第一样本数据对应的检索特征;根据获取的不同模态的所述第一样本数据所对应的所述检索特征之间的差异确定第一损失值,并根据所述第一损失值调整各模态对应的所述第一特征提取网络和所述第二特征提取网络。
在一种可能的实施方式中,所述预训练方法还包括:对属于图像模态或视频模态的第二样本数据进行图像增强,得到与所述第二样本数据对应的增强样本数据;将所述第二样本数据和所述增强样本数据输入与所述图像模态或所述视频模态对应的所述第一特征提取网络中,分别获取所述第二样本数据和所述增强样本数据的数据特征;将所述第二样本数据和所述增强样本数据的数据特征输入与所述图像模态或视频模态对应的所述第二特征提取网络中,分别获取所述第二样本数据和所述增强样本数据对应的检索特征;根据所述第二样本数据和所述增强样本数据分别对应的所述检索特征之间的差异确定第二损失值,并根据所述第二损失值调整所述图像模态或所述视频模态对应的所述第一特征提取网络和所 述第二特征提取网络。
在一种可能的实施方式中,所述预训练方法还包括:对属于文本模态的第三样本数据中的原始文本内容进行随机部分遮盖,得到与所述第三样本数据对应的掩码样本数据;通过与所述文本模态对应的所述第一特征提取网络和与所述文本模态对应的所述第二特征提取网络提取所述掩码数据样本对应的检索特征;根据所述掩码数据样本对应的所述检索特征预测所述掩码样本数据中被随机部分遮盖的预测文本;将所述预测文本和所述原始文本内容之间差异确定第三损失值,并根据所述第三损失值调整所述文本模态对应的所述第一特征提取网络和所述第二特征提取网络。
图8是根据本公开又一示例性实施例示出的一种多模态数据检索装置的结构框图。如图8所示,在所述第一处理模块将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征之前,所述装置还包括:获取模块40,用于获取目标检索任务;确定模块50,用于根据所述目标检索任务所对应的目标模态确定需要进行所述微调训练的所述第一特征提取网络和所述第二特征提取网络;微调模块60,用于根据所述目标检索任务所对应的第四样本数据,对所述第一特征提取网络和所述第二特征提取网络进行微调训练,并将所述第一特征提取网络和所述第二特征提取网络替换为经过所述微调训练的第一特征提取网络和经过所述微调训练的第二特征提取网络。
在一种可能的实施方式中,所述检索模块30还用于:根据所述目标检索特征在检索数据库中对所述目标检索数据进行检索,所述检索数据库包括待检索数据和/或所述待检索数据对应的检索特征,其中,所述待检索数据对应的检索特征为通过与所述待检索数据对应的所述第一特征提取网络和所述第二特征提取网络获取得到。
在一种可能的实施方式中,所述第二特征提取网络为Transformer模型网络。
下面参考图9,其示出了适于用来实现本公开实施例的电子设备900的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(ROM)902中的程序或者从存储装置908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄 像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配 入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;根据所述目标检索特征进行检索。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一处理模块还可以被描述为“将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种多模态数据检索方法,包括:将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;根据所述目标检索特征进行检索。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述第一特征提取网络和所述第二特征提取网络通过预训练得到。
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述第一特征提取网络和所述第二特征提取网络同时进行预训练,所述预训练方法包括:将内容一致但模态不同的两个或多个第一样本数据分别输入与所述第一样本数据的模态相对应的所述第一特征提取网络中,得到所述第一样本数据的数据特征;将所述第一样本数据的数据特征分别输入与所述第一样本数据相对应的所述第二特征提取网络中,获取与所述第一样本数据对应的检索特征;根据获取的不同模态的所述第一样本数据所对应的所述检索特征之间的差异确定第一损失值,并根据所述第一损失值调整各模态对应的所述第一特征提取网络和所述第二特征提取网络。
根据本公开的一个或多个实施例,示例4提供了示例3的方法,所述预训练方法还包括:对属于图像模态或视频模态的第二样本数据进行图像增强,得到与所述第二样本数据对应的增强样本数据;将所述第二样本数据和所述增强样本数据输入与所述图像模态或所述视频模态对应的所述第一特征提取网络中,分别获取所述第二样本数据和所述增强样本数据的数据特征;将所述第二样本数据和所述增强样本数据的数据特征输入与所述图像模态或视频模态对应的所述第二特征提取网络中,分别获取所述第二样本数据和所述增强样本数据对应的检索特征;根据所述第二样本数据和所述增强样本数据分别对应的所述检索特征之间的差异确定第二损失值,并根据所述第二损失值调整所述图像模态或所述视频模态对应的所述第一特征提取网络和所述第二特征提取网络。
根据本公开的一个或多个实施例,示例5提供了示例3的方法,所述预训练方法还包括:对属于文本模态的第三样本数据中的原始文本内容进行随机部分遮盖,得到与所述第三样本数据对应的掩码样本数据;通过与所述文本模态对应的所述第一特征提取网络和与所述文本模态对应的所述第二特征提取网络提取所述掩码数据样本对应的检索特征;根据所述掩码数据样本对应的所述检索特征预测所述掩码样本数据中被随机部分遮盖的预测文本;将所述预测文本和所述原始文本内容之间差异确定第三损失值,并根据所述第三损失值调整所述文本模态对应的所述第一特征提取网络和所述第二特征提取网络。
根据本公开的一个或多个实施例,示例6提供了示例2的方法,在将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征之前,所述方法还包括:获取目标检索任务;根据所述目标检索任务所对应的目标模态确定需要进行所述微调训练的所述第一特征提取网络和所述第二特征提取网络;根据所述目标检索任务所对应的第四样本数据,对所述第一特征提取网络和所述第二特征提取网络进行微调训练,并将所述第一特征提取网络和所述第二特征提取网络替换为经过所述微调训练的第一特征提取网络和经过所述微调训练的第二特征提取网络。
根据本公开的一个或多个实施例,示例7提供了示例1-6中任一示例的方法,所述根据所述目标检索特征进行检索包括:根据所述目标检索特征在检索数据库中对所述目标检索数据进行检索,所述检索数据库包括待检索数据和/或所述待检索数据对应的检索特征,其中,所述待检索数据对应的检索特征为通过与所述待检索数据对应的所述第一特征提取网络和所述第二特征提取网络获取得到。
根据本公开的一个或多个实施例,示例8提供了示例1-6中任一示例的方法,所述第二特征提取网络为Transformer模型网络。
根据本公开的一个或多个实施例,示例9提供了一种多模态数据检索装置,所述装置包括:第一处理模块,用于将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;第二处理模块,用于将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;检索模块,用于根据所述目标检索特征进行检索。
根据本公开的一个或多个实施例,示例10提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-8中任一项所述方法的步骤。
根据本公开的一个或多个实施例,示例11提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-8中任一项所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (11)

  1. 一种多模态数据检索方法,其特征在于,所述方法包括:
    将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;
    将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;
    根据所述目标检索特征进行检索。
  2. 根据权利要求1所述的方法,其特征在于,所述第一特征提取网络和所述第二特征提取网络通过预训练得到。
  3. 根据权利要求2所述的方法,其特征在于,所述第一特征提取网络和所述第二特征提取网络同时进行所述预训练,所述预训练方法包括:
    将内容一致但模态不同的两个或多个第一样本数据分别输入与所述第一样本数据的模态相对应的所述第一特征提取网络中,得到所述第一样本数据的数据特征;
    将所述第一样本数据的数据特征分别输入与所述第一样本数据相对应的所述第二特征提取网络中,获取与所述第一样本数据对应的检索特征;
    根据获取的不同模态的所述第一样本数据所对应的所述检索特征之间的差异确定第一损失值,并根据所述第一损失值调整各模态对应的所述第一特征提取网络和所述第二特征提取网络。
  4. 根据权利要求3所述的方法,其特征在于,所述预训练方法还包括:
    对属于图像模态或视频模态的第二样本数据进行图像增强,得到与所述第二样本数据对应的增强样本数据;
    将所述第二样本数据和所述增强样本数据输入与所述图像模态或所述视频模态对应的所述第一特征提取网络中,分别获取所述第二样本数据和所述增强样本数据的数据特征;
    将所述第二样本数据和所述增强样本数据的数据特征输入与所述图像模态或视频模态对应的所述第二特征提取网络中,分别获取所述第二样本数据和所述增强样本数据对应的检索特征;
    根据所述第二样本数据和所述增强样本数据分别对应的所述检索特征之间的差异确定第二损失值,并根据所述第二损失值调整所述图像模态或所述视频模态对应的所述第一特征提取网络和所述第二特征提取网络。
  5. 根据权利要求3所述的方法,其特征在于,所述预训练方法还包括:
    对属于文本模态的第三样本数据中的原始文本内容进行随机部分遮盖,得到与所述第三样本数据对应的掩码样本数据;
    通过与所述文本模态对应的所述第一特征提取网络和与所述文本模态对应的所述第二特征提取网络提取所述掩码数据样本对应的检索特征;
    根据所述掩码数据样本对应的所述检索特征预测所述掩码样本数据中被随机部分遮盖的预测文本;
    将所述预测文本和所述原始文本内容之间差异确定第三损失值,并根据所述第三损失值调整所述文本模态对应的所述第一特征提取网络和所述第二特征提取网络。
  6. 根据权利要求2所述的方法,其特征在于,在将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征之前,所述方法还包括:
    获取目标检索任务;
    根据所述目标检索任务所对应的目标模态确定需要进行所述微调训练的所述第一特征提取网络和所述第二特征提取网络;
    根据所述目标检索任务所对应的第四样本数据,对所述第一特征提取网络和所述第二特征提取网络进行微调训练,并将所述第一特征提取网络和所述第二特征提取网络替换为经过所述微调训练的第一特征提取网络和经过所述微调训练的第二特征提取网络。
  7. 根据权利要求1-6中任一权利要求所述的方法,其特征在于,所述根据所述目标检索特征进行检索包括:
    根据所述目标检索特征在检索数据库中对所述目标检索数据进行检索,所述检索数据库包括待检索数据和/或所述待检索数据对应的检索特征,其中,所述待检索数据对应的检索特征为通过与所述待检索数据的模态对应的所述第一特征提取网络和所述第二特征提取网络获取得到。
  8. 根据权利要求1-6中任一权利要求所述的方法,其特征在于,所述第二特征提取网络为Transformer模型网络。
  9. 一种多模态数据检索装置,其特征在于,所述装置包括:
    第一处理模块,用于将目标检索数据输入与所述目标检索数据的模态相对应的第一特征提取网络中,获取所述目标检索数据的数据特征;
    第二处理模块,用于将所述数据特征输入与所述目标检索数据的模态相对应的第二特征提取网络中,获取与所述目标检索数据所对应的目标检索特征,其中,各模态分别对应的各第二特征提取网络之间权重共享;
    检索模块,用于根据所述目标检索特征进行检索。
  10. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-8中任一项所述方法的步骤。
  11. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-8中任一项所述方法的步骤。
PCT/CN2022/089241 2021-05-25 2022-04-26 多模态数据检索方法、装置、介质及电子设备 WO2022247562A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/563,222 US20240233334A1 (en) 2021-05-25 2022-04-26 Multi-modal data retrieval method and apparatus, medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110573402.0 2021-05-25
CN202110573402.0A CN113449070A (zh) 2021-05-25 2021-05-25 多模态数据检索方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2022247562A1 true WO2022247562A1 (zh) 2022-12-01

Family

ID=77810171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089241 WO2022247562A1 (zh) 2021-05-25 2022-04-26 多模态数据检索方法、装置、介质及电子设备

Country Status (3)

Country Link
US (1) US20240233334A1 (zh)
CN (1) CN113449070A (zh)
WO (1) WO2022247562A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449070A (zh) * 2021-05-25 2021-09-28 北京有竹居网络技术有限公司 多模态数据检索方法、装置、介质及电子设备
TWI784780B (zh) * 2021-11-03 2022-11-21 財團法人資訊工業策進會 多模態影片檢測方法、多模態影片檢測系統及非暫態電腦可讀取媒體
CN114610911B (zh) * 2022-03-04 2023-09-19 中国电子科技集团公司第十研究所 多模态知识本征表示学习方法、装置、设备及存储介质
US20230334839A1 (en) * 2022-04-19 2023-10-19 Lemon Inc. Feature extraction
CN114581838B (zh) * 2022-04-26 2022-08-26 阿里巴巴达摩院(杭州)科技有限公司 图像处理方法、装置和云设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273876A (zh) * 2017-07-18 2017-10-20 山东大学 一种基于深度学习的‘宏to微转换模型’的微表情自动识别方法
CN111930992A (zh) * 2020-08-14 2020-11-13 腾讯科技(深圳)有限公司 神经网络训练方法、装置及电子设备
US20210012061A1 (en) * 2019-07-12 2021-01-14 Nec Laboratories America, Inc. Supervised cross-modal retrieval for time-series and text using multimodal triplet loss
CN112487822A (zh) * 2020-11-04 2021-03-12 杭州电子科技大学 一种基于深度学习的跨模态检索方法
CN112650868A (zh) * 2020-12-29 2021-04-13 苏州科达科技股份有限公司 图像检索方法、装置及存储介质
CN112668671A (zh) * 2021-03-15 2021-04-16 北京百度网讯科技有限公司 预训练模型的获取方法和装置
CN113449070A (zh) * 2021-05-25 2021-09-28 北京有竹居网络技术有限公司 多模态数据检索方法、装置、介质及电子设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189968B (zh) * 2018-08-31 2020-07-03 深圳大学 一种跨模态检索方法及系统
CN112487217A (zh) * 2019-09-12 2021-03-12 腾讯科技(深圳)有限公司 跨模态检索方法、装置、设备及计算机可读存储介质
CN110597878B (zh) * 2019-09-16 2023-09-15 广东工业大学 一种多模态数据的跨模态检索方法、装置、设备及介质
CN111783903B (zh) * 2020-08-05 2023-11-28 腾讯科技(深圳)有限公司 文本处理方法、文本模型的处理方法及装置、计算机设备
CN111882002B (zh) * 2020-08-06 2022-05-24 桂林电子科技大学 一种基于msf-am的低照度目标检测方法
CN111680705B (zh) * 2020-08-13 2021-02-26 南京信息工程大学 适于目标检测的mb-ssd方法和mb-ssd特征提取网络
CN112035728B (zh) * 2020-08-21 2023-07-25 中国电子科技集团公司电子科学研究院 一种跨模态检索方法、装置及可读存储介质
CN112148916A (zh) * 2020-09-28 2020-12-29 华中科技大学 一种基于监督的跨模态检索方法、装置、设备及介质
CN112488131B (zh) * 2020-12-18 2022-06-14 贵州大学 一种基于自监督对抗的图片文本跨模态检索方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273876A (zh) * 2017-07-18 2017-10-20 山东大学 一种基于深度学习的‘宏to微转换模型’的微表情自动识别方法
US20210012061A1 (en) * 2019-07-12 2021-01-14 Nec Laboratories America, Inc. Supervised cross-modal retrieval for time-series and text using multimodal triplet loss
CN111930992A (zh) * 2020-08-14 2020-11-13 腾讯科技(深圳)有限公司 神经网络训练方法、装置及电子设备
CN112487822A (zh) * 2020-11-04 2021-03-12 杭州电子科技大学 一种基于深度学习的跨模态检索方法
CN112650868A (zh) * 2020-12-29 2021-04-13 苏州科达科技股份有限公司 图像检索方法、装置及存储介质
CN112668671A (zh) * 2021-03-15 2021-04-16 北京百度网讯科技有限公司 预训练模型的获取方法和装置
CN113449070A (zh) * 2021-05-25 2021-09-28 北京有竹居网络技术有限公司 多模态数据检索方法、装置、介质及电子设备

Also Published As

Publication number Publication date
US20240233334A1 (en) 2024-07-11
CN113449070A (zh) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
CN111368185B (zh) 数据展示方法、装置、存储介质及电子设备
WO2023273578A1 (zh) 语音识别方法、装置、介质及设备
JP2023547917A (ja) 画像分割方法、装置、機器および記憶媒体
WO2022252881A1 (zh) 图像处理方法、装置、可读介质和电子设备
CN111666416B (zh) 用于生成语义匹配模型的方法和装置
WO2023143016A1 (zh) 特征提取模型的生成方法、图像特征提取方法和装置
WO2022037419A1 (zh) 音频内容识别方法、装置、设备和计算机可读介质
WO2023273596A1 (zh) 确定文本相关性的方法、装置、可读介质及电子设备
WO2023279843A1 (zh) 内容搜索方法、装置、设备和存储介质
WO2023142913A1 (zh) 视频处理方法、装置、可读介质及电子设备
CN113033682B (zh) 视频分类方法、装置、可读介质、电子设备
CN111090993A (zh) 属性对齐模型训练方法及装置
WO2023142914A1 (zh) 日期识别方法、装置、可读介质及电子设备
WO2021012691A1 (zh) 用于检索图像的方法和装置
CN113033707B (zh) 视频分类方法、装置、可读介质及电子设备
WO2022012178A1 (zh) 用于生成目标函数的方法、装置、电子设备和计算机可读介质
JP7337933B2 (ja) 情報を送信するための方法及び装置、サーバ、記憶媒体並びにコンピュータプログラム
CN113515687B (zh) 物流信息的获取方法和装置
WO2023130925A1 (zh) 字体识别方法、装置、可读介质及电子设备
WO2023000782A1 (zh) 获取视频热点的方法、装置、可读介质和电子设备
CN113986958B (zh) 文本信息的转换方法、装置、可读介质和电子设备
CN116244431A (zh) 文本分类方法、装置、介质及电子设备
CN115424060A (zh) 模型训练方法、图像分类方法和装置
CN111460214B (zh) 分类模型训练方法、音频分类方法、装置、介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22810295

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18563222

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22810295

Country of ref document: EP

Kind code of ref document: A1