CN114861016A

CN114861016A - Cross-modal retrieval method and device and storage medium

Info

Publication number: CN114861016A
Application number: CN202210781046.6A
Authority: CN
Inventors: 阮晓峰; 王坚; 李兵; 余昊楠; 胡卫明
Original assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Current assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-08-05

Abstract

The application discloses a cross-modal retrieval method, a cross-modal retrieval device and a storage medium. The cross-modal retrieval method comprises the following steps: receiving retrieval data and determining the modality of the retrieval data; inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature expression vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data; traversing the index database according to the feature expression vectors, and inquiring a plurality of candidate retrieval results related to the retrieval data; and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

Description

Cross-modal retrieval method and device and storage medium

Technical Field

The present application relates to the field of cross-modal retrieval technologies, and in particular, to a cross-modal retrieval method, an apparatus, and a storage medium.

Background

Currently, cross-modal retrieval is getting more and more applied. Through a cross-modal search model provided in a computing device, a user can search by inputting search data of different modalities. For example, when the user wants to retrieve information related to "airplane", text retrieval data "airplane" may be input to the computing device for retrieval, or a picture or video including the airplane may be input to the computing device for retrieval. Accordingly, the computing device searches according to the text search data 'airplane', the picture containing the airplane or the video containing the airplane input by the user, and accordingly obtains a search result related to the 'airplane' subject.

The published invention patent CN114048282A discloses a text tree local matching-based cross-modal retrieval method and system for graphics and texts, the method comprises: acquiring a data set, preprocessing and dividing the data set to obtain a training set; respectively inputting the pictures and texts in the training set into corresponding networks for feature extraction to obtain picture features and text features; generating a text tree according to the text characteristics; performing image-text similarity calculation and back propagation training network according to the characteristics of the text tree and the image to obtain a cross-modal retrieval model; and acquiring data to be detected and inputting the data to the cross-modal retrieval model to obtain a retrieval result.

The published patent application CN114003753A discloses a picture retrieval method and device. The method comprises the following steps: extracting features of the picture to be retrieved to obtain a first feature vector; determining each second feature vector meeting the first similarity requirement with the first feature vector from the feature library; clustering each second feature vector, and determining each obtained clustering center as each third feature vector serving as a retrieval sample; for any third feature vector, determining each fourth feature vector meeting the second similarity requirement with the third feature vector from the feature library; and determining a retrieval result corresponding to the picture to be retrieved through each fourth feature vector. The method belongs to the condition of searching through an input picture.

The cross-modal search model can be roughly divided into a single-flow search model and a double-flow search model. The single-flow retrieval model adopts a multi-mode fusion feature extraction unit to extract features of retrieval data of different modes, so that the retrieval data of different modes are extracted by the same feature extraction unit. The double-flow retrieval model comprises a plurality of independent feature extraction units which are used for respectively extracting features of retrieval data of different modes. For example, the dual-stream search model performs feature extraction on text search data by using a text feature extraction unit, and performs feature extraction on image search data or video search data by using an image feature extraction unit.

In the feature extraction mode of the single-flow retrieval model, the same feature extraction unit is adopted to extract features of retrieval data in different modes, so that the retrieval data in different modes can be fully interacted, and the feature extraction performance is better. However, the feature extraction unit of the single-flow retrieval model has poor expansibility, and is time-consuming in an actual retrieval task. In the feature extraction mode of the double-flow retrieval model, different feature extraction units are adopted to extract features of retrieval data in different modes, so that the model is high in expandability, and the retrieval speed is high in an actual retrieval task. However, in the dual-stream search model, different feature extraction units are used for extracting features of search data in different modalities, so that the search data in different modalities cannot be sufficiently interacted compared with the search data in a single-stream search model, and therefore, the performance of feature extraction is poor, and the search accuracy is reduced. Therefore, at present, there is no better cross-modal search model that can simultaneously achieve both the search accuracy and the search speed.

Aiming at the technical problem that the existing cross-modal retrieval model cannot simultaneously take the retrieval precision and the retrieval speed into consideration, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the disclosure provides a cross-modal retrieval method, a cross-modal retrieval device and a storage medium, so as to solve at least the technical problem that a cross-modal retrieval model in the prior art cannot simultaneously consider the retrieval precision, the retrieval speed and the model expansibility.

According to an aspect of the embodiments of the present disclosure, there is provided a cross-modal retrieval method, including: receiving retrieval data and determining the modality of the retrieval data; inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature expression vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data; traversing the index database according to the feature expression vectors, and inquiring a plurality of candidate retrieval results related to the retrieval data; and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the above method is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided a cross-modal retrieval apparatus, including: the retrieval data receiving module is used for receiving the retrieval data and determining the modality of the retrieval data; the characteristic extraction module is used for inputting the retrieval data into a characteristic extraction model with at least two characteristic extraction units and extracting a characteristic representation vector of the retrieval data through the characteristic extraction unit corresponding to the modality of the retrieval data; the query module is used for traversing the index database according to the feature expression vectors and querying a plurality of candidate retrieval results related to the retrieval data; and the sequencing display module is used for inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity and sequencing the candidate retrieval results according to the similarity.

According to another aspect of the embodiments of the present disclosure, there is also provided a cross-modal retrieval apparatus, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: receiving retrieval data and determining the modality of the retrieval data; inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature expression vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data; traversing the index database according to the feature expression vectors, and inquiring a plurality of candidate retrieval results related to the retrieval data; and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

Therefore, the technical scheme of the embodiment realizes the fusion of the feature information of different layers through the multi-semantic fusion operation, can fully utilize the information of each feature extraction layer, and fully excavates and represents the semantic information of the multimedia information, thereby relieving the semantic gap problem of the information of different modes, and improving the accuracy of cross-mode retrieval. The method solves the technical problem that the existing feature extraction model can not accurately retrieve data information of different modes when cross-mode retrieval is carried out due to semantic gap of data of different modes in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;

fig. 2 is a schematic diagram of a cross-modal retrieval system according to embodiment 1 of the present disclosure;

fig. 3 is a block schematic diagram of a cross-modal retrieval system according to embodiment 1 of the present disclosure;

fig. 4 is a schematic flowchart of a cross-modal retrieval method according to the first aspect of embodiment 1 of the present disclosure;

fig. 5 is a schematic diagram of a feature extraction module of the search server according to embodiment 1 of the present disclosure;

fig. 6 is a schematic diagram of a ranking module of a retrieval server according to embodiment 1 of the present disclosure;

fig. 7 is a schematic diagram of training a feature extraction model of a feature extraction module according to embodiment 1 of the present disclosure;

fig. 8 is a schematic structural diagram of each feature extraction unit in the feature extraction model and the similarity calculation model according to embodiment 1 of the present disclosure;

FIG. 9 is a schematic diagram of a feature extraction model according to the prior art;

fig. 10 is a schematic diagram of a cross-modal retrieval apparatus according to embodiment 2 of the present disclosure; and

fig. 11 is a schematic diagram of a cross-modal retrieval apparatus according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to the present embodiment, there is provided a method embodiment of a semantic-based data processing method, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware block diagram of a computing device for implementing a semantic-based data processing method. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the semantic-based data processing method in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the semantic-based data processing method of the application software. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

Fig. 2 is a schematic diagram of a cross-modal retrieval system according to the present embodiment. Referring to fig. 2, the system includes: a terminal device 100 and a search server 200 communicatively connected to the terminal device 100 via a network. The user may input, for example, search data containing semantic information at the terminal device 100, and the search data may be, for example, text information, an image, audio, or video input by the user. Thus, the terminal device 100 transmits the retrieval data input by the user to the retrieval server 200. After receiving the search data, the search server 200 performs a cross-modality search based on the search data, thereby returning search results of different modalities related to the search data to the terminal device 100.

Fig. 3 shows a schematic block diagram of the search server 200. Referring to fig. 3, the retrieval server 200 includes a feature extraction module 210, a multimedia index repository 220, a retrieval engine 230, a ranking module 240, and a crawler module 250.

The crawler module 250 is used for crawling multimedia information from the internet, such as text data, audio data, image data, video data, and the like.

The feature extraction module 210 is configured to receive the crawled multimedia information from the crawler module 250, extract feature representation vectors of the multimedia information from the received multimedia information, and save the extracted feature representation vectors in the multimedia index repository 220. And the feature extraction module 210 is further configured to receive search data input by a user from the terminal device 100, extract a feature representation vector of the search data, and transmit the extracted feature representation vector to the search engine 230.

The multimedia index library 220 is used for storing the feature representation vectors extracted by the feature extraction module 210 for the multimedia information crawled by the crawler module 250 and corresponding indexes.

The search engine 230 is configured to receive the feature expression vector of the search data sent by the terminal device 100 from the feature extraction module 210, traverse multimedia information related to the search data (i.e., candidate search results) in the multimedia index database 220 based on the feature expression vector, and send the searched multimedia information to the ranking module 240.

The ranking module 240 receives the retrieved multimedia information from the retrieval engine 230 and ranks the retrieved multimedia information.

It should be noted that the above-described hardware configuration can be applied to both the terminal device 100 and the search server 200 in the system.

In the above operating environment, according to the first aspect of the present embodiment, a cross-modal search method is provided, which is implemented by the search server 200 shown in fig. 2. Fig. 4 shows a flow diagram of the method, which, with reference to fig. 4, comprises:

s102: receiving retrieval data and determining the modality of the retrieval data;

s104: inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature expression vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data;

s106: traversing the index database according to the feature expression vectors, and inquiring a plurality of candidate retrieval results related to the retrieval data; and

s108: and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

Specifically, referring to fig. 2 and 3, the user may input search data through the terminal device 100 in order to search a search result related to the search data. The feature extraction module 210 of the server 200 thus receives the retrieval data and determines the modality of the retrieval data (S102). For example, when the user inputs text search data, the feature extraction module 210 determines that the modality of the search data is text; when the user inputs the image retrieval data, the feature extraction module 210 determines the modality of the retrieval data as an image; and when the user inputs video retrieval data, the feature extraction module 210 determines the modality of the retrieval data as a video.

Fig. 5 further illustrates a schematic diagram of the feature extraction module 210 shown in fig. 3. Referring to fig. 5, the feature extraction module 210 includes a feature extraction model 211. Wherein the feature extraction model 211 comprises at least two feature extraction units. For example, the feature extraction model 211 may include a text feature extraction unit 211a for performing feature extraction on text search data and an image feature extraction unit 211b for performing feature extraction on image search data or video search data. That is, the feature extraction model 211 may employ the structure of a dual stream (or multi-stream) search model. Thus, the feature extraction module 210 may input the search data to the feature extraction unit corresponding to the modality of the search data according to the modality of the received search data (S104). For example, when the retrieval data is text retrieval data, the retrieval data may be input to the text feature extraction unit 211a so that the feature representation vector corresponding to the retrieval data is extracted by the text feature extraction unit 211 a; alternatively, when the search data is image search data or video search data, the search data may be input to the image feature extraction unit 211b so that the feature representation vector corresponding to the search data is extracted by the image feature extraction unit 211 b. In this way, the feature extraction module 210 thus extracts a feature representation vector corresponding to the retrieval data and inputs the extracted feature representation vector to the retrieval engine 230. Further, although fig. 5 exemplarily shows a structure in which the feature extraction model 211 is a two-stream search model, the feature extraction model 211 may include a larger number of feature extraction units. For example, the feature extraction model 211 may further include an audio feature extraction unit for performing feature extraction on the audio retrieval data.

The search engine 230 performs traversal in the multimedia information index repository 220 according to the feature expression vector corresponding to the search data, thereby querying a plurality of candidate search results related to the search data (S106). Specifically, the search engine 230 may match the feature expression vector of the search data with the feature expression vector of each multimedia information in the multimedia information index library 220 by means of feature matching, for example, to obtain a candidate search result. The search engine 230 may then input the candidate search results and the search data to the ranking module 240 for ranking.

Fig. 6 shows a schematic diagram of the sorting module 240. Referring to fig. 6, the ranking module 240 includes a similarity calculation model 241 and a ranking unit 242. And a multi-modal fusion feature extraction unit 241a and a similarity calculation unit 241b are provided in the similarity calculation model 241. The multi-modal fusion feature extraction unit 241a is a feature extraction unit capable of extracting feature expression vectors for data of different modalities. The similarity calculation unit 241b performs similarity calculation using the feature expression vector extracted by the multimodal fusion feature extraction unit 241 a. That is, the multi-modal fusion feature extraction unit 241a of the similarity calculation model 241 adopts a single-stream search model structure, and the single multi-modal fusion feature extraction unit 241a extracts feature expression vectors for data of different modalities such as text, images, or video.

Thus, after receiving the candidate search results including different modalities (i.e., return data including different modalities such as text, image, or video), the ranking module 240 inputs the candidate search results into the similarity calculation model 241, so as to obtain the feature expression vectors of the candidate search results through the multi-modal fusion feature extraction unit 241 a. Furthermore, the sorting module 240 may further acquire the retrieval data from the feature extraction module 210, thereby determining a feature representation vector associated with the retrieval data by the multi-modal fusion feature extraction unit 241a of the similarity calculation model 241.

The ranking module 240 then calculates the similarity between each candidate search result and the search data by the similarity calculation unit 241b of the similarity calculation model 241 based on the feature expression vector of each candidate search result and the feature expression vector of the search data determined by the multi-modal fusion feature extraction unit 241 a. The ranking module 240 then ranks the respective candidate search results according to the calculated similarity using the ranking unit 242. Specifically, the ranking module 240 calculates the similarity between the feature representation vector of each candidate search result and the feature representation vector of the search data using the similarity calculation unit 241b, thereby ranking each candidate search result according to the magnitude of the similarity between the feature representation vector of each candidate search result and the feature representation vector of the search data using the ranking unit 242 (S108). Thus, unlike the way in which the feature extraction module 210 extracts features using a dual-stream (or multi-stream) structure, the sorting module 240 extracts the search data and the feature expression vectors of the candidate search results using a single-stream structure.

As described in the background art, in the feature extraction method of the single-flow search model, the same feature extraction unit is used for extracting features from search data of different modalities, so that the search data of different modalities can be fully interacted, and the feature extraction performance is better. However, the feature extraction unit of the single-flow retrieval model has poor expansibility, and is time-consuming in an actual retrieval task. In the feature extraction mode of the double-flow retrieval model, different feature extraction units are adopted to extract features of retrieval data in different modes, so that the model is high in expandability, and the retrieval speed is high in an actual retrieval task. However, in the dual-stream search model, different feature extraction units are used for extracting features of search data in different modalities, so that the search data in different modalities cannot be sufficiently interacted compared with the search data in a single-stream search model, and therefore, the performance of feature extraction is poor, and the search accuracy is reduced. Therefore, at present, there is no better cross-modal search model that can simultaneously achieve both the search accuracy and the search speed.

In view of this, in the technical solution of this embodiment, a feature extraction unit corresponding to a modality of the search data is selected from at least two feature extraction units corresponding to different modalities by using a feature extraction model to perform feature extraction on the search data, so as to obtain a feature expression vector corresponding to the search data and perform a search. After a plurality of candidate retrieval results related to the retrieval data are inquired, the retrieval data and the candidate retrieval results are input to a similarity calculation model with a multi-mode fusion feature extraction unit for similarity calculation, and the candidate retrieval results are ranked according to the similarity. Thereby the feature extraction modes of two different structures are combined beneficially.

Specifically, in the technical solution of this embodiment, when performing retrieval, the feature extraction model having a plurality of feature extraction units corresponding to different modalities is used to perform retrieval in the index library to obtain a candidate retrieval result, so that the disadvantage of time consumption of the single-stream model is avoided, and the retrieval speed is increased. Then, in the technical scheme of the embodiment, a similarity calculation model of a single-flow structure is used for calculating the similarity of the candidate retrieval results, and the candidate retrieval results are sorted according to the similarity. The similarity calculation model of the single-flow structure can extract features with higher precision, so that the precision of the candidate retrieval result sorting can be improved, the more matched candidate retrieval results are sorted in front, and the requirements of users are met. The candidate retrieval results are ranked by adopting the ranking module with the single-flow structure, and only feature extraction and similarity calculation need to be carried out on the candidate retrieval results, but not on all data information in the index database, so that the calculation range of the similarity calculation model with the single-flow structure is reduced, the defect that the similarity calculation model with the single-flow structure consumes time is overcome, and the total retrieval time is reduced. Therefore, according to the technical scheme, the precision of the single-flow retrieval model and the speed of the double-flow (or multi-flow) retrieval model can be considered at the same time, and the technical problem that the existing cross-modal retrieval model cannot simultaneously consider the retrieval precision and the retrieval speed is solved.

Further, optionally, the search engine 230 may perform search matching in different metric ways according to different search tasks or search scenarios. For example, the search engine 230 may perform search matching of pictures using euclidean distance, perform face recognition using cosine distance, recommend multimodal data using inner product, and perform a large-scale video search scene using hamming distance. In addition, the search engine 230 of this embodiment may also map the features of the image or the video to a binary space through a hash algorithm, and compare the similarity using a hamming distance. The search engine 230 may employ vector search tools commonly used, such as Faiss, SPTAG, Milvus, Proxima, and Vearch, among others.

Further, referring to fig. 3, the multimedia information index library 220 may be created by:

the multimedia information index library 220 is first constructed. The retrieval server 200 then crawls multimedia source data from data sources (e.g., various multimedia platforms) of the internet through a crawler module 250. Then, the retrieval server 200 extracts a feature representation vector associated with the multimedia source data from the multimedia source data through the feature extraction module 210. The feature extraction module 210 then saves the feature representation vector corresponding to the multimedia source data to the multimedia information index repository 230 (i.e., index repository) (S206).

Specifically, for example, the multimedia information index database 220 may set different multimedia information tables for different modalities of dataTable ⁱ ，i=0 or 1, whereinTable ⁰ An index information table representing text source data,Table ¹ an index information table representing image source data. Index information table of text source dataTable ⁰ Storing the feature representation vectors associated with the text source data and marking corresponding indexes; index information of image source dataTable ¹ Feature representation vectors associated with image source data are stored and corresponding indices are labeled. Wherein, the dimension of the data record of the index information table of the text source data and the image source data is d, and corresponds to the feature representation vector (i.e. CLS vector) associated with the multimedia source data. Thus, in this embodiment, the multimedia information id of different modalities corresponds to the extracted feature expression vectors one to one, and the multimedia information index library 220 is constructed. As the feature extraction module 210 continuously adds corresponding feature expression vectors according to the crawled multimedia information, the data of the multimedia information index library 220 is updated accordingly. Thus, the feature extraction module 210 facilitates retrieval by the retrieval engine 230 by storing feature representation vectors of multimedia information to the multimedia information index repository 220.

Optionally, the feature extraction model includes a text feature extraction unit and an image feature extraction unit, and the operation of extracting the feature representation vector of the retrieval data by the feature extraction unit corresponding to the modality of the retrieval data includes: determining a feature expression vector corresponding to the text retrieval data by using a text feature extraction unit under the condition that the retrieval data is text retrieval data; and determining, by the image feature extraction unit, a feature representation vector corresponding to the image retrieval data or the video retrieval data in a case where the retrieval data is the image retrieval data or the video retrieval data.

Specifically, referring to FIG. 5, the feature extraction module 210 includes a feature extraction model 211. Wherein the feature extraction model 211 comprises at least two feature extraction units. For example, the feature extraction model 211 may include a text feature extraction unit 211a for performing feature extraction on text search data and an image feature extraction unit 211b for performing feature extraction on image search data or video search data, that is, the feature extraction model 211 may adopt a structure of a two-stream (or multi-stream) search model. Thus, the feature extraction module 210 may input the search data to the feature extraction unit corresponding to the modality of the search data according to the modality of the received search data. For example, when the retrieval data is text retrieval data, the retrieval data may be input to the text feature extraction unit 211a so that the feature representation vector corresponding to the retrieval data is extracted by the text feature extraction unit 211 a; alternatively, when the search data is image search data or video search data, the search data may be input to the image feature extraction unit 211b so that the feature representation vector corresponding to the search data is extracted by the image feature extraction unit 211 b. In this way, the feature extraction module 210 thus extracts a feature representation vector corresponding to the retrieval data and inputs the extracted feature representation vector to the retrieval engine 230. Further, although fig. 5 exemplarily shows a structure in which the feature extraction model 211 is a two-stream search model, the feature extraction model 211 may include a larger number of feature extraction units. For example, the feature extraction model 211 may further include an audio feature extraction unit for performing feature extraction on the audio retrieval data.

Further, it is preferable that the feature extracted by the text feature extraction unit 211a and the image feature extraction unit 211b is a semantic feature of the vector. For example, the text feature extraction unit 211a and the image feature extraction unit 211b each include a plurality of attention mechanism-based feature extraction layers, such as a transform layer, so as to extract semantic features of the retrieval data. Further, the text feature extraction unit 211a may be, for example, a BERT model-based feature extraction unit, and the image feature extraction unit 211b may be, for example, an ViT model-based feature extraction unit. The text feature extraction unit 211a and the image feature extraction unit 211b thus include a plurality of Transform layers, respectively.

Therefore, the feature extraction module 210 with a double-flow (or multi-flow) structure is adopted in the embodiment to extract the feature expression vector of the retrieval data, so that the data of different modes can be expressed in a similar semantic space, and measurement and retrieval by a retrieval engine are facilitated.

Optionally, the method further comprises training the feature extraction model by: creating a training sample set for training the feature extraction model, wherein each training sample of the training sample set comprises paired text data and image data; inputting the text data of the training sample into a text feature extraction unit, and inputting the image data of the training sample into an image feature extraction unit; and training the feature extraction model according to mutual information between the output result of the text feature extraction unit and the output result of the image feature extraction unit.

Specifically, fig. 7 shows a schematic diagram of training the text feature extraction unit 211a and the image feature extraction unit 211 b. Referring to fig. 7, the text feature extraction unit 211a and the image feature extraction unit 211b are first initialized with a priori information.

Then, the pairs of training samples (text + image/video) are input to the text feature extraction unit 211a and the image feature extraction unit 211b, respectively. In one batch training, the feature corresponding to the semantically related text-image is used as a positive sample, and the feature corresponding to the semantically unrelated text-image is used as a negative sample. Thereby obtaining text semantic features associated with the text of the training sample from the text feature extraction unit 211a and image semantic features associated with the image/video of the training sample from the image feature extraction unit 211 b.

Then, the text feature extraction unit 211a and the image feature extraction unit 211b are trained according to the text semantic features and the image semantic features associated with the training samples by using an InfoNCE loss function (i.e., mutual information loss function). Until the text semantic features output by the text feature extraction unit 211a and the image semantic features output by the image feature extraction unit 211b reach the maximum mutual information for the same training sample, the training of the text feature extraction unit 211a and the training of the image feature extraction unit 211b are completed.

Optionally, the operation of determining, by the text feature extraction unit, a feature representation vector corresponding to the text retrieval data includes: inputting the text retrieval data into the text feature extraction unit; acquiring output characteristics of a plurality of characteristic extraction layers of the text characteristic extraction unit; and performing weighted summation on the output features of the feature extraction layers to obtain feature expression vectors corresponding to the text retrieval data.

Specifically, fig. 8 shows a schematic configuration applied to the text feature extraction unit 211a or the image feature extraction unit 211 b. Referring to FIG. 8, the feature extraction unit includes an embedded layer and a transform layer 0-0L(i.e., a feature extraction layer based on attention mechanism) and a multi-semantic fusion module. The number of layers of the transform layers of the text feature extraction unit 211a and the image feature extraction unit 211b may be the same or different. That is, for the text feature extraction unit 211a and the image feature extraction unit 211b,Lthe numerical values of (A) may be the same or different。

Therefore, when feature extraction of text search data is performed by the feature extraction model 211, the text search data is first input into the embedding layer of the text feature extraction unit, and is then extracted from the transform layer 0-0LAnd extracting semantic features of the text retrieval data.

Among them, semantic features generated by the transform layer 0

Is outputted to the transform layer 1. The Transformer layer 1 is based on semantic features

Generating semantic features

And generating semantic features

And outputting the result to a Transformer layer 2. By analogy, the transform layerL-1 generated semantic features

Is output to the transform layerLThus, a transform layerLAccording to semantic features

Generating semantic features

And output semantic features

. Thus, feature extraction model 211 obtains the transform layerLSemantic features of output

And a Transformer layer 0 to Transformer layerL-semantic features of 1 output

. Wherein the semantic features

For example, d-dimensional CLS vectors generated by the respective transform layers.

Then, referring to FIG. 8, feature extraction model 211 applies semantic features to the features

And inputting the data to a multi-semantic fusion module. Whereby the multi-semantic fusion module combines semantic features

Fusing to generate semantic features associated with entered text search data

I.e. feature representation vectors of the text retrieval data.

In the actual retrieval process, data of different modalities are heterogeneous in nature, and a large semantic gap exists. Therefore, the purpose of cross-modal retrieval is to find a uniform measurement space, measure the distance between different modalities through a measurement mode, eliminate semantic gaps between different modalities and realize mutual retrieval between different modalities. However, since the semantic information of the extracted features of the conventional feature extraction model (for example, BERT or ViT) is single and the features of different modalities cannot be aligned sufficiently, there is a case where data information of different modalities cannot be accurately retrieved when the conventional feature extraction model is applied to cross-modality retrieval.

Specifically, fig. 9 shows a schematic diagram of an existing feature extraction unit. Referring to FIG. 9, the conventional feature extraction unit includes only an embedding layer and a plurality of transform layers 0 to L. Thus, the existing feature extraction unit outputs the last transform layer L as the final layer semantic feature

The feature expression vector of the text search data as input.

Therefore, the existing feature extraction model only uses the last Transformer layerLSemantic features of output

The feature of the input text retrieval data represents a vector, so that semantic information output by the feature extraction model is single, and features of different modes cannot be sufficiently aligned. Therefore, when the existing feature extraction model is applied to the cross-modality search, there is still a case where data information of different modalities cannot be accurately searched.

In view of this, the technical solution of the embodiment is not to be finally appliedA Transformer layerLSemantic features of output

As the feature representation vector of the input text retrieval data, the last Transformer layer is usedLFormer 0-Transformer layerL-semantic features of 1 output

And the last transform layerLSemantic features of output

Fusion generation of feature representation vectors associated with input text search data

。

Therefore, the technical scheme of the embodiment realizes the fusion of the feature information of different layers through the multi-semantic fusion operation, can fully utilize the information of the transform layer, fully excavates and represents the semantic information of the text retrieval data, further relieves the semantic gap problem of the information of different modes, and improves the accuracy of cross-mode retrieval. The method solves the technical problem that the existing feature extraction model can not accurately retrieve data information of different modes when cross-mode retrieval is carried out due to semantic gap of data of different modes in the prior art.

Optionally, the operation of determining a feature representation vector corresponding to the image retrieval data or the video retrieval data by using the image feature extraction unit includes: inputting the image retrieval data or the video retrieval data into an image feature extraction unit; acquiring output characteristics of a plurality of characteristic extraction layers of an image characteristic extraction unit; and carrying out weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the image retrieval data or the video retrieval data.

Specifically, the configuration of the image feature extraction unit may also be as shown in fig. 8, except for the number of layersLCan be compared with the textThe feature extraction units may be the same or different. The image feature extraction unit 211b includes an embedded layer and a transform layer 0 toL(i.e., a feature extraction layer based on attention mechanism) and a multi-semantic fusion module. Therefore, when extracting the features of the image search data or the video search data, the search server 200 first inputs the image search data or the video search data into the embedding layer of the feature extraction model, and then extracts the features from the transform layer 0 —, and then extracts the features from the image search data or the video search dataLAnd extracting semantic features of the image retrieval data or the video retrieval data.

Among them, semantic features generated by the transform layer 0

Is output to the transform layer 1. The Transformer layer 1 is based on semantic features

Generating semantic features

And generating semantic features

Generating semantic features

And output semantic features

And a Transformer layer 0 to Transformer layerL-semantic features of 1 output

. Wherein the semantic features

For example, d-dimensional CLS vectors generated by the respective transform layers may be used.

Fusing to generate semantic features associated with the input image search data or video search data

I.e. a feature representation vector of image retrieval data or video retrieval data.

As described above, data of different modalities is heterogeneous in nature and there is a large semantic gap. Therefore, the purpose of cross-modal retrieval is to find a uniform measurement space, measure the distance between different modalities through a measurement mode, eliminate semantic gaps between different modalities and realize mutual retrieval between different modalities. However, since the semantic information of the extracted features of the conventional feature extraction model (for example, BERT or ViT) is single and the features of different modalities cannot be aligned sufficiently, there is a case where data information of different modalities cannot be accurately retrieved when the conventional feature extraction model is applied to cross-modality retrieval.

In view of this, the technical solution of this embodiment does not use the last transform layerLSemantic features of output

As the feature expression vector of the input image retrieval data or video retrieval data, the last Transformer layer is usedLFormer 0-Transformer layerL-semantic features of 1 output

And the last transform layerLSemantic features of output

。

And further, the text feature extraction unit 211a and the image feature extraction unit 211b of the feature extraction model 211 each determine the feature representation vector associated with the retrieved data in a manner that combines the final-layer semantic features with the intermediate-layer semantic features. Therefore, the feature extraction model 211 realizes the fusion of feature information of different layers through multi-semantic fusion operation, can fully utilize the information of the transform layer, and fully excavates and represents the semantic information of the retrieval data. Furthermore, the embodiment alleviates the semantic gap problem of different modal information in a double-flow (or multi-flow) structure, thereby improving the accuracy of cross-modal retrieval.

In addition, the semantic fusion module shown in fig. 8 may characterize the semantics through the Linear layer and/or the sigmoid layer

Fused into d-dimensional vectors

. For example, the semantic fusion module can determine the vector according to the Linear layer and the sigmoid layer

And determining a characterization vector according to the following formula

：

。

Wherein the parameters of the Linear layer can be determined by learning following the feature extraction unit (such as the text feature extraction unit and the image feature extraction unit) in the training process of the whole feature extraction unit.

Therefore, the semantic features output by each transform layer can be converted through the Linear layer and/or the sigmoid layer

And the semantic information of different layers of the multimedia information can be comprehensively mined and expressed by fully fusing, so that the accuracy of subsequent cross-modal retrieval is improved.

In addition, as an example, the text feature extraction unit and the image feature extraction unit in the present embodiment extract all the transform layers 0 toLAnd fusing the output semantic features. But the Transformer layer 0 to the Transformer layerL-1, selecting intermediate layer semantic features output by partial transform layers and a final transform layerLThe output final layer semantic features are fused, and the method and the device can also be applied to the technical scheme of the disclosure. For example, Transfo may be used2-transformation layer of the rmer layerL-1 output semantic features and final transform layerLAnd fusing the output final semantic features. And will not be described in detail herein.

Optionally, the operation of traversing the index library according to the feature expression vector and querying a plurality of candidate retrieval results related to the retrieval data includes: traversing the index database according to the feature expression vectors, inquiring a predetermined number of candidate feature expression vectors with the correlation degree of the feature expression vectors being arranged from high to low, and acquiring candidate retrieval results corresponding to the candidate feature expression vectors.

Specifically, referring to fig. 3, the search engine 230, after receiving the feature representation vector of the search data from the feature extraction module 210, traverses the multimedia information index database 220 using the feature representation vector. For example, the search engine 230 may perform similarity calculation for the feature expression vector of the search data and the feature expression vector of each data stored in the multimedia information index database 220, thereby calculating the similarity between the feature expression vector of each multimedia information stored in the multimedia information index database 220 and the feature expression vector of the search data. The search engine 230 then selects a predetermined number of candidate eigenvectors from the eigenvectors of each multimedia message stored in the multimedia message index database 220 according to the order of similarity from high to low. The search engine 230 then takes the multimedia information corresponding to the candidate eigenvector as a candidate search result based on the determined candidate eigenvector. As an example, the predetermined number may be, for example, 100, but may also be other numbers, which are determined according to actual needs.

Optionally, the similarity calculation model is a similarity calculation model using multi-modal fusion, and further includes training the similarity calculation model by: creating a training sample set for training the similarity calculation model, wherein each training sample of the training sample set comprises paired text data and image data; labeling the similarity between the paired text data and image data; inputting the paired text data and image data into a similarity calculation model; and training the similarity calculation model according to the labeled similarity and the similarity calculated by the similarity calculation model.

Specifically, as shown in fig. 6, a similarity calculation model 241 is provided in the ranking module 240.

To train the similarity calculation model 241, a training sample set for training the similarity calculation model 241 is first created. The training sample set includes a plurality of training samples, and each training sample includes a pair of text data and image data. Then, the staff can add labels to each training sample in a manual mode, wherein the label information is the similarity between the text data and the image data of the training sample. Then, for the training samples in the training sample set, the training samples are respectively input to the similarity calculation model 241, and the following operations are performed:

inputting the text data and the image data of the training sample into a similarity calculation model 241, respectively, extracting feature expression vectors of the text data and the image data by using a multi-modal fusion feature extraction unit 241a of the similarity calculation model 241, and calculating the similarity between the text data and the image data according to the feature expression vectors of the text data and the feature expression vectors of the image data by using a similarity calculation unit 241b of the similarity calculation model 241; and

for the training sample, the similarity calculation model 241 is trained by using the similarity calculated by the similarity calculation model 241 and the manually labeled similarity.

Thus, in the above manner, the training of the similarity calculation model 241 can be completed.

In addition, the structure of the multi-modal fusion feature extraction unit 241a in the similarity calculation model 241 can also refer to the structure shown in fig. 8, except that the number L of layers of the transform layer and the dimension d of the feature representation vector can be different from those in the text feature extraction unit and the image feature extraction unit, and are not described again here.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

Therefore, in the technical scheme of the embodiment, when the retrieval is performed, the feature extraction model with the plurality of feature extraction units corresponding to different modalities is used for performing the retrieval in the index library to obtain the candidate retrieval result, so that the defect that the single-flow model consumes time is avoided, and the retrieval speed is accelerated. Then, in the technical scheme of the embodiment, a similarity calculation model of a single-flow structure is used for calculating the similarity of the candidate retrieval results, and the candidate retrieval results are sorted according to the similarity. The similarity calculation model of the single-flow structure can extract features with higher precision, so that the precision of the candidate retrieval result sorting can be improved, the more matched candidate retrieval results are sorted in front, and the requirements of users are met. The candidate retrieval results are ranked by adopting the ranking module with the single-flow structure, and only feature extraction and similarity calculation need to be carried out on the candidate retrieval results, but not on all data information in the index database, so that the calculation range of the similarity calculation model with the single-flow structure is reduced, the defect that the similarity calculation model with the single-flow structure consumes time is overcome, and the total retrieval time is reduced. Therefore, according to the technical scheme, the precision of the single-flow retrieval model and the speed of the double-flow (or multi-flow) retrieval model can be considered at the same time, and the technical problem that the existing cross-modal retrieval model cannot simultaneously consider the retrieval precision and the retrieval speed is solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 10 shows a cross-modality retrieval apparatus 1000 according to the present embodiment, the apparatus 1000 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 10, the apparatus 1000 includes: a retrieval data receiving module 1010, configured to receive retrieval data and determine a modality of the retrieval data; a feature extraction module 1020 for inputting the search data to a feature extraction model having at least two feature extraction units, and extracting a feature representation vector of the search data by a feature extraction unit corresponding to a modality of the search data; the query module 1030 is configured to traverse the index database according to the feature expression vectors, and query a plurality of candidate retrieval results related to the retrieval data; and a sorting module 1040, configured to input the search data and the candidate search result into a similarity calculation model with the multi-modal fusion feature extraction unit, perform similarity calculation, and sort the candidate search result according to the similarity.

Optionally, the feature extraction model includes a text feature extraction unit and an image feature extraction unit, and the feature extraction module 1020 includes: the text feature extraction submodule is used for determining a feature expression vector corresponding to the text retrieval data by using a text feature extraction unit under the condition that the retrieval data is the text retrieval data; and an image feature extraction sub-module for determining, by the image feature extraction unit, a feature representation vector corresponding to the image retrieval data or the video retrieval data, in a case where the retrieval data is the image retrieval data or the video retrieval data.

Optionally, the feature extraction model is trained by: creating a training sample set for training the feature extraction model, wherein each training sample of the training sample set comprises paired text data and image data; inputting the text data of the training sample into a text feature extraction unit, and inputting the image data of the training sample into an image feature extraction unit; and training the feature extraction model according to mutual information between the output result of the text feature extraction unit and the output result of the image feature extraction unit.

Optionally, the text feature extraction sub-module includes: a text retrieval data input unit for inputting the text retrieval data into the text feature extraction unit; the text feature acquisition unit is used for acquiring output features of a plurality of feature extraction layers of the text feature extraction unit; and the text feature fusion unit is used for carrying out weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the text retrieval data.

Optionally, the image feature extraction sub-module includes: an image retrieval data input unit for inputting the image retrieval data or the video retrieval data to the image feature extraction unit; an image feature acquisition unit configured to acquire output features of a plurality of feature extraction layers of the image feature extraction unit; and the image feature fusion unit performs weighted summation on the output features of the plurality of feature extraction layers to obtain a feature representation vector corresponding to the image retrieval data or the video retrieval data.

Optionally, the query module 1030 includes a traversal submodule, configured to traverse the index library according to the feature expression vectors, query a predetermined number of candidate feature expression vectors that are arranged in a descending order of the degree of correlation between the candidate feature expression vectors and the feature expression vectors, and obtain candidate retrieval results corresponding to the candidate feature expression vectors.

Optionally, the similarity calculation model is trained by: creating a training sample set for training the similarity calculation model, wherein each training sample of the training sample set comprises paired text data and image data; labeling the similarity between the paired text data and image data; inputting the paired text data and image data into a similarity calculation model; and training the similarity calculation model according to the labeled similarity and the similarity calculated by the similarity calculation model.

Example 3

Fig. 11 shows a cross-modality retrieval apparatus 1100 according to the present embodiment, the apparatus 1100 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 11, the apparatus 1100 includes: a processor 1110; and a memory 1120, coupled to the processor 1110, for providing instructions to the processor 1110 to process the following processing steps: receiving retrieval data and determining the modality of the retrieval data; inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature expression vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data; traversing the index database according to the feature expression vectors, and inquiring a plurality of candidate retrieval results related to the retrieval data; and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

Optionally, the feature extraction model includes a text feature extraction unit and an image feature extraction unit, and the operation of extracting the feature representation vector of the search data by the feature extraction unit corresponding to the modality of the search data includes: determining a feature representation vector corresponding to the text retrieval data by using the text feature extraction unit when the retrieval data is text retrieval data; and determining a feature representation vector corresponding to the image retrieval data or the video retrieval data by using an image feature extraction unit in the case that the retrieval data is the image retrieval data or the video retrieval data.

Optionally, the feature extraction model is trained by: creating a training sample set for training the feature extraction model, wherein each training sample of the training sample set comprises paired text data and image data; inputting the text data of the training sample into the text feature extraction unit, and inputting the image data of the training sample into the image feature extraction unit; and training the feature extraction model according to mutual information between the output result of the text feature extraction unit and the output result of the image feature extraction unit.

Optionally, the operation of determining, by the text feature extraction unit, a feature representation vector corresponding to the text retrieval data includes: inputting the text retrieval data into the text feature extraction unit; acquiring output characteristics of a plurality of characteristic extraction layers of the text characteristic extraction unit; and carrying out weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the text retrieval data.

Optionally, the operation of determining a feature representation vector corresponding to the image retrieval data or the video retrieval data by using an image feature extraction unit includes: inputting the image retrieval data or the video retrieval data to the image feature extraction unit; acquiring output characteristics of a plurality of characteristic extraction layers of the image characteristic extraction unit; and performing weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the image retrieval data or the video retrieval data.

Optionally, the operation of traversing the index library according to the feature expression vector and querying a plurality of candidate retrieval results related to the retrieval data includes: traversing an index library according to the feature expression vectors, inquiring a predetermined number of candidate feature expression vectors with the correlation degree of the feature expression vectors being arranged from high to low, and obtaining candidate retrieval results corresponding to the candidate feature expression vectors.

Optionally, the similarity calculation model is trained by: creating a training sample set for training the similarity calculation model, each training sample of the training sample set including paired text data and image data; labeling the similarity between the paired text data and image data; inputting the paired text data and image data into the similarity calculation model; and training the similarity calculation model according to the labeled similarity and the similarity calculated by the similarity calculation model.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A cross-modal search method, comprising:

receiving retrieval data and determining the modality of the retrieval data;

inputting the retrieval data into a feature extraction model with at least two feature extraction units, and extracting feature representation vectors of the retrieval data through the feature extraction units corresponding to the modalities of the retrieval data;

traversing an index database according to the feature expression vector, and inquiring a plurality of candidate retrieval results related to the retrieval data; and

and inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity, and sequencing the candidate retrieval results according to the similarity.

2. The method according to claim 1, wherein the feature extraction model includes a text feature extraction unit and an image feature extraction unit, and the operation of extracting the feature representation vector of the search data by the feature extraction unit corresponding to the modality of the search data includes:

determining a feature representation vector corresponding to the text retrieval data by using the text feature extraction unit when the retrieval data is text retrieval data; and

and in the case that the retrieval data is image retrieval data or video retrieval data, determining a feature representation vector corresponding to the image retrieval data or the video retrieval data by using an image feature extraction unit.

3. The method of claim 2, further comprising training the feature extraction model by:

creating a training sample set for training the feature extraction model, wherein each training sample of the training sample set comprises paired text data and image data;

inputting the text data of the training sample into the text feature extraction unit, and inputting the image data of the training sample into the image feature extraction unit; and

and training the feature extraction model according to mutual information between the output result of the text feature extraction unit and the output result of the image feature extraction unit.

4. The method of claim 2, wherein the operation of determining, by the text feature extraction unit, a feature representation vector corresponding to the text search data comprises:

inputting the text retrieval data into the text feature extraction unit;

acquiring output characteristics of a plurality of characteristic extraction layers of the text characteristic extraction unit; and

and performing weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the text retrieval data.

5. The method of claim 2, wherein the operation of determining, with the image feature extraction unit, a feature representation vector corresponding to the image retrieval data or the video retrieval data comprises:

inputting the image retrieval data or the video retrieval data to the image feature extraction unit;

acquiring output characteristics of a plurality of characteristic extraction layers of the image characteristic extraction unit; and

and performing weighted summation on the output features of the feature extraction layers to obtain a feature representation vector corresponding to the image retrieval data or the video retrieval data.

6. The method of claim 1, wherein the operation of traversing the index database according to the feature representation vector to query a plurality of candidate search results related to the search data comprises:

traversing an index library according to the feature expression vectors, inquiring a predetermined number of candidate feature expression vectors with the correlation degree of the feature expression vectors being arranged from high to low, and obtaining candidate retrieval results corresponding to the candidate feature expression vectors.

7. The method of claim 6, further comprising training the similarity computation model by:

creating a training sample set for training the similarity calculation model, each training sample of the training sample set including paired text data and image data;

labeling the similarity between the paired text data and image data;

inputting the paired text data and image data into the similarity calculation model; and

and training the similarity calculation model according to the marked similarity and the similarity calculated by the similarity calculation model.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. A cross-modality retrieval apparatus, comprising:

the retrieval data receiving module is used for receiving retrieval data and determining the modality of the retrieval data;

the characteristic extraction module is used for inputting the retrieval data into a characteristic extraction model with at least two characteristic extraction units and extracting a characteristic representation vector of the retrieval data through the characteristic extraction unit corresponding to the modality of the retrieval data;

the query module is used for traversing an index database according to the feature expression vector and querying a plurality of candidate retrieval results related to the retrieval data; and

and the sequencing display module is used for inputting the retrieval data and the candidate retrieval results into a similarity calculation model with a multi-mode fusion feature extraction unit, calculating the similarity and sequencing the candidate retrieval results according to the similarity.

10. A cross-modality retrieval apparatus, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

receiving retrieval data and determining the modality of the retrieval data;