CN115885274A

CN115885274A - Cross-modal retrieval method, training method of cross-modal retrieval model and related equipment

Info

Publication number: CN115885274A
Application number: CN202180050439.3A
Authority: CN
Inventors: 萧人豪
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-08-31
Filing date: 2021-06-11
Publication date: 2023-03-31
Also published as: WO2022041940A1

Abstract

The application discloses a cross-modal retrieval method, a cross-modal retrieval model training method, electronic equipment and a computer-readable storage medium, wherein the cross-modal retrieval method firstly extracts a feature vector of a first modal sample serving as a retrieval target to generate a first unit vector set, and feature vectors of a plurality of second modal samples serving as candidate objects to generate a second unit vector set, then performs pooling processing on the first unit vector set and the second unit vector set respectively, projects the pooled first global vector and the pooled second global vectors to the same joint space respectively, and finally obtains a cross-modal retrieval result based on similarity between the projected first joint vector and each second joint vector. The first global vector includes the position information of the first cell included in the first mode sample, and the second global vector includes the position information of the second cell included in the second mode sample, so that the accuracy of the retrieval result can be improved.

Description

Cross-modal retrieval method, training method of cross-modal retrieval model and related equipment

Technical Field

The present application relates to the field of retrieval technologies, and in particular, to a cross-modal retrieval method, a training method for a cross-modal retrieval model, an electronic device, and a computer-readable storage medium.

Background

In the conventional cross-modality search technology, features of different modality samples are generally required to be projected into the same embedding space to obtain search results. In the process of obtaining the features, the sample is generally split into a plurality of units, then the features of each unit are respectively extracted, and then the features of all the units are subjected to pooling processing to obtain the features of the sample. A common pooling is average pooling, but average pooling easily results in loss of important information of the sample, thereby reducing the accuracy of the search results.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a cross-modal retrieval method, a training method of a cross-modal retrieval model, an electronic device and a computer readable storage medium, and the accuracy of a retrieval result can be improved.

In order to solve the technical problem, the application adopts a technical scheme that: a cross-modal retrieval method is provided, which comprises the following steps:

inputting a first modality sample as a retrieval target and a plurality of second modality samples as candidate objects into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedded network and a second embedded network;

extracting a first unit vector set of the first modality samples by using a first feature extraction network, and respectively extracting a second unit vector set of each second modality sample by using a second feature extraction network, wherein the first modality samples comprise a plurality of first units, the first unit vector set consists of feature vectors of all the first units, each second modality sample comprises a plurality of second units, and each second unit vector set consists of feature vectors of all the second units in the corresponding second modality sample;

pooling the first unit vector set by using a pooling network to obtain a first global vector, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein the first global vector comprises position information of the first unit in a first modal sample, and each second global vector comprises corresponding position information of the second unit in a corresponding second modal sample;

projecting the first global vector to a joint space by using a first embedded network to obtain a first joint vector, and projecting each second global vector to the joint space by using a second embedded network to obtain a plurality of second joint vectors;

and acquiring a cross-modal retrieval result of the first modal sample based on the similarity between the first joint vector and each second joint vector.

In order to solve the above technical problem, another technical solution adopted by the present application is: a training method of a cross-modal search model is provided, which comprises the following steps:

inputting a plurality of first modality samples and a plurality of second modality samples into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedding network and a second embedding network, and each first modality sample and/or each second modality sample is/are provided with tag data;

respectively extracting a first unit vector set of each first modality sample by using a first feature extraction network, and respectively extracting a second unit vector set of each second modality sample by using a second feature extraction network, wherein each first modality sample comprises a plurality of first units, each first unit vector set consists of feature vectors of all first units in the corresponding first modality sample, each second modality sample comprises a plurality of second units, and each second unit vector set consists of a set consisting of feature vectors of all second units in the corresponding second modality sample;

pooling each first unit vector set by using a pooling network to obtain a plurality of first global vectors, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein each first global vector comprises position information of the first unit in the corresponding first modal sample, and each second global vector comprises position information of the second unit in the corresponding second modal sample;

projecting each first global vector to a joint space by using a first embedded network to obtain a plurality of first joint vectors, and projecting each second global vector to the joint space by using a second embedded network to obtain a plurality of second joint vectors;

parameters of the search model are adjusted based on the first joint vector and the second joint vector.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: there is provided an electronic device, including a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is capable of executing the program instructions to implement the cross-modal search method according to the above technical solution, or the training method of the cross-modal search model according to the above technical solution.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: there is provided a computer readable storage medium having stored thereon program instructions executable by a processor to implement the cross-modal search method of the above technical solution, or the training method of the cross-modal search model of the above technical solution.

In order to solve the technical problem, the other technical scheme adopted by the application is as follows: there is provided an electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program instructions and the processor is capable of executing the program instructions to implement:

The beneficial effect of this application is: in the cross-modal retrieval method provided by the application, the first feature extraction network may extract a first unit vector set of the first modal sample, the second feature extraction network may extract a second unit vector set of each second modal sample, the first modal sample may include a plurality of first units, the first unit vector set may be composed of feature vectors of all the first units, each second modal sample may include a plurality of second units, each second unit vector set is composed of feature vectors of all the second units of the corresponding second modal sample, the pooling network may pool the first unit vector set to obtain a first global vector, and pool each second unit vector set to obtain a plurality of second global vectors, respectively, wherein the first global vector includes corresponding position information of the first unit in the first modal sample, each second global vector includes corresponding position information of the second unit in the corresponding second modal sample, the first embedding network may project the first global vector to a joint space to obtain a first joint vector, the second embedding network may project the second global vector to a joint space to obtain a joint result based on a similarity between the first unit vector set and the second unit vector set, and the second unit vector set. That is to say, the present application retains the position information in the samples of different modalities in the pooling process, so that the accuracy of the retrieval result can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flowchart illustrating an embodiment of a cross-modal search method of the present application;

FIG. 2a is an example of a search result of image matching;

FIG. 2b is a second exemplary search result of image matching;

FIG. 2c is a third example of a search result of image matching;

FIG. 3 is a schematic flowchart illustrating an embodiment of step S13 in FIG. 1;

FIG. 4 is a flowchart illustrating an embodiment of step S21 in FIG. 3;

FIG. 5 is a flowchart illustrating an embodiment of step S22 in FIG. 3;

FIG. 6 is a schematic flow chart illustrating another embodiment of step S13 in FIG. 1;

FIG. 7 is a flowchart illustrating an embodiment of step S51 in FIG. 6;

FIG. 8 is a flowchart illustrating an embodiment of step S52 in FIG. 6;

FIG. 9 is a schematic structural diagram of an embodiment of a cross-modal search apparatus according to the present application;

FIG. 10 is a flowchart illustrating an embodiment of a cross-modal search model training method according to the present application;

FIG. 11 is a flowchart illustrating an embodiment of step S85 in FIG. 10;

FIG. 12 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a cross-modal search method according to the present application, the search method includes the following steps:

and step S11, inputting a first modality sample as a retrieval target and a plurality of second modality samples as candidate objects into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedded network and a second embedded network.

In the cross-modal retrieval process, a first-modal sample is used as a retrieval target, a plurality of second-modal samples are used as candidate objects, such as text samples, image samples, audio samples and other samples of different modalities, and the purpose of retrieval is to obtain a plurality of second-modal samples matched with the first-modal sample. For convenience of understanding, the embodiment uses the first modality sample as a text sample, and the second modality sample as an image sample for illustration. For example, the present embodiment first inputs one text sample and a plurality of image samples into a retrieval model to obtain one or more image samples matching the text sample as an image matching retrieval result. The retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedded network and a second embedded network, the retrieval model is trained in advance, and a specific training method is described later.

Step S12, a first unit vector set of the first modality samples is extracted by using the first feature extraction network, and a second unit vector set of each second modality sample is respectively extracted by using the second feature extraction network, wherein the first modality samples comprise a plurality of first units, the first unit vector set comprises feature vectors of all the first units, each second modality sample comprises a plurality of second units, and each second unit vector set comprises feature vectors of all the second units in the corresponding second modality sample.

The first modality sample includes a plurality of first units, and the second modality sample includes a plurality of second units, in this embodiment, the first feature extraction network may be used to extract a feature vector of each first unit to obtain a first unit vector set, and the second feature extraction network may be used to extract a feature vector of each second unit to obtain a second unit vector set, where the second unit vector set corresponds to the second modality sample one to one.

Taking a text sample as a first modality sample and an image sample as a second modality sample for example, the text sample includes a plurality of words, and the image sample includes a plurality of salient regions, each salient region containing an object of interest. The first feature extraction network may extract a plurality of word vectors in the text sample, and the second feature extraction network may extract a plurality of region vectors in the image sample, thereby obtaining a word vector set corresponding to the text sample and a region vector set corresponding to the image sample, and each image sample corresponds to one region vector set. Wherein the first feature extraction network includes, but is not limited to, skip-grammar model (skip-grammar model) which can vectorize all words of the text sample, and the second feature extraction network includes, but is not limited to, faster R-CNN model and ResNet model which can detect the object of interest from the image sample and vectorize the salient region containing the object of interest.

Step S13, pooling the first unit vector set by using a pooling network to obtain a first global vector, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein the first global vector contains the position information of the first unit in the first modal sample, and each second global vector contains the corresponding position information of the second unit in the corresponding second modal sample.

Each first cell in the first modality sample has its own absolute position information and relative position information with other first cells, and each second cell in the second modality sample also has its own absolute position information and relative position with other second cells. In this embodiment, a first global vector and a second global vector are obtained by pooling a first unit vector set and a second unit vector set respectively and retaining the position information in the pooling process, where the first global vector corresponds to a first modality sample as a retrieval target and each second global vector corresponds to one of a plurality of second modality samples as candidate objects.

Still take a text sample as a first modality sample and an image sample as a second modality sample, each word in the text sample has absolute position information in the text sample and relative position information with other words, and each salient region in the image sample has absolute position information in the image sample and relative position information with other salient regions. The present embodiment separately vectorizes the position information of each word and the position information of each salient region, and adds the vectorized position information to the feature vectors corresponding to each word and the salient region, thereby preserving the position information during the pooling process. That is, the pooling process in this embodiment is "sequential code mixing pooling". After pooling, a first global vector of text samples is obtained, as well as a second global vector for each image sample. The specific pooling process will be described later.

And S14, projecting the first global vector to a joint space by using a first embedded network to obtain a first joint vector, and projecting each second global vector to the joint space by using a second embedded network to obtain a plurality of second joint vectors.

After the first global vector and the plurality of second global vectors are obtained through pooling respectively, the first global vector and the plurality of second global vectors need to be projected to the same joint space by using an embedded network, and then retrieval can be carried out. Specifically, in this embodiment, the first embedded network is used to project the first global vector to the joint space to obtain a first joint vector, and the second embedded network is used to project each of the second global vectors to the joint space to obtain a plurality of second joint vectors. That is, the first union vector corresponds to the first modality sample as the retrieval target, and each second union vector corresponds to one of the plurality of second modality samples as the candidate object, so that the relationship comparison between the retrieval target and each candidate object can be obtained later by comparing the relationship between the first union vector and each second union vector.

For example, a first global vector corresponding to the text sample and a second global vector corresponding to each of the plurality of image samples are projected to the same image semantic united space, so that the relationship contrast between the text sample and each image sample can be obtained.

And S15, acquiring a cross-modal retrieval result of the first modal sample based on the similarity between the first joint vector and each second joint vector.

After the first joint vector and the plurality of second joint vectors are obtained, in this embodiment, the cross-modal search result is obtained by obtaining the similarity between the first joint vector and each of the second joint vectors, that is, the similarity between the first joint vector and each of the second joint vectors may be used as the relationship between the corresponding first modal sample and each of the second modal samples to perform quantization contrast, so as to obtain the search result.

Specifically, the second modality samples corresponding to the specified number of second join vectors with the highest similarity may be selected as the cross-modality retrieval result of the first modality samples. That is, distances between the first join vector and each second join vector in the join space may be calculated and sorted to output, as a search result, second modality samples corresponding to a specified number (e.g., 8, 10, 15, etc.) of second join vectors that are closest to the first join vector.

For example, the similarity between the text sample and each image sample may be obtained and ranked, so that the image sample with the highest similarity is output as the retrieval result of image matching. Please refer to fig. 2a, fig. 2b and fig. 2c, which are examples of search results of image matching respectively, where fig. 2a is a search result of image matching of a text sample "a dog playing on the grass", fig. 2b is a search result of image matching of a text sample "training a mobile", and fig. 2c is a search result of image matching of a text sample "cosmetic scope", and all 10 image samples with the highest similarity are output. Therefore, the embodiment improves the accuracy of the retrieval result by retaining the position information in the samples in different modes in the pooling process.

In an embodiment, referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of step S13 in fig. 1, where the first global vector is obtained through the following steps.

Step S21, embedding the position information of each first unit in the first mode sample into a corresponding feature vector to obtain a first sequential encoding vector set.

In this embodiment, in the process of obtaining the first global vector, the position information of each first unit in the first modality sample may be retained, specifically, the position information is embedded in the corresponding feature vector, so as to obtain the first sequential encoding vector set.

In an embodiment, referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of step S21 in fig. 3, and the position information may be embedded into the corresponding feature vector through the following steps.

Step S31 is to generate a first position vector of each first cell based on the position information of each first cell in the first mode sample, the first position vector having the same dimension as the corresponding feature vector.

Taking a text sample as an example of the first modality sample, generating a first position vector of each word based on position information of each word in the text sample, that is, vectorizing the position information, wherein the first position vector has the same dimension as a corresponding feature vector, so that the feature vector is convenient to embed.

Specifically, the first position vector may be obtained by the following formula (1):

where PE represents the first position vector, pos represents the position of the first unit in the first modality sample, D represents a dimension of the first position vector PE, i represents an integer between 0 and (D-1)/2, such as 0,1,2 … …,2i represents an even dimension of the dimensions of the first position vector PE, e.g., zero dimension, second dimension, fourth dimension … …,2i +1 represents an odd dimension of the dimensions of the first position vector PE, e.g., first dimension, third dimension, fifth dimension … …, PE (pos, 2 i) represents values of the positions pos and 2i dimension, which can be calculated by a sine function, and PE (pos, 2i + 1) is a value of the positions pos and 2i +1 dimension, which can be calculated by a cosine function.

Step S32, replacing the feature vector of each first unit in the first unit vector set with a corresponding first sequential encoding vector to obtain a first sequential encoding vector set, where the first sequential encoding vector of each first unit is the sum of the first position vector and the feature vector of the first unit.

After the first position vector is obtained, it may be added to the corresponding feature vector to obtain a first sequential encoding vector, thereby obtaining a first set of sequential encoding vectors. The first set of sequential encoding vectors may be obtained by equation (2) as follows:

X′＝X+PE……(2)；

wherein, X is the feature vector of the first unit, and X' is the corresponding first order encoding vector. For example, in a text sample, X is a feature vector corresponding to a word in the text sample, and X' is a first order encoding vector of the word.

The embodiment can obtain the first position vector by vectorizing the position information of the first unit, and use the sum of the first position vector and the corresponding feature vector as the first order encoding vector, so that the position information of the first unit in the first modality sample is retained, and the accuracy of the retrieval result can be improved.

Step S22, embedding the position information of each first unit in the first mode sample into a corresponding feature vector to obtain a first sequential encoding vector set.

Specifically, referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of step S22 in fig. 3, and the pooling network may be implemented by pooling the first sequential encoding vector set through the following steps.

Step S41, clustering the first sequence coding vector set based on the soft distribution function to obtain a plurality of classes.

The clustering operation can place several first-order code vectors with the nearest distance into one class, and the index s related to each first-order code vector in each class can be obtained through the soft distribution function shown in the following formula (3) _k (x′ _i ) And realizing clustering operation:

wherein K is the number of classes, x' _i For the first order code vector in the kth class,

and b _k For learnable parameters associated with the kth class, K being an integer between 1 and K, s _k (x′ _i ) Represents a first sequential coded vector x' _i Probability of belonging to the kth class. />

And S42, acquiring a first pooling feature of each class based on the first full-link layer, and acquiring a second pooling feature of each class based on the second full-link layer.

After the clustering operation, the first fully-connected layer may be based

Obtaining a first pooling feature G for each class ₁ (X', k) and based on the second fully connected layer->

Obtaining a second pooling feature G for each class ₂ (X', k), specifically, obtaining the first pooling characteristic and the second pooling characteristic of each class by the following formula (4), respectively:

where N is the first sequential coded vector x 'in the kth class' _i Number of (c) _k As the cluster center of the kth class, σ _k Is the normalized parameter of the kth class.

And S43, inputting the sum of the first pooling features and the second pooling features of all classes into a third full-link layer to obtain a first global vector.

Respectively obtaining a first pooling characteristic G of each class ₁ (X', k) and a second pooling characteristic G ₂ (X', k) then, the sum of the first pooled features of all classes is calculated

And the sum of the second pooling characteristics of all classes->

Then the sum of the two is input into a third fully connected layer->

Inputting the sum of the first pooled feature and the second pooled feature of all classes into a third fully-connected layer to obtain a first global vector P _OEM (X'). This process can be expressed by the following formula (5):

in this embodiment, the first order encoding vector set including the position information is pooled through the multiple fully-connected layers, so that the first global vector of the first modality sample is obtained, and the accuracy of the search result can be improved.

In an embodiment, please refer to fig. 6, wherein fig. 6 is a flowchart illustrating another embodiment of step S13 in fig. 1, and the second global vector is obtained through the following steps.

And S51, embedding the position information of each first unit in the first mode sample into the corresponding feature vector to obtain a first sequence coding vector set.

In this embodiment, in the process of obtaining the second global vector, the position information of the second unit in the second modality sample may be retained, specifically, the position information of each second unit in each second modality sample is embedded into the corresponding feature vector, so as to obtain a plurality of second sequential encoding vector sets.

In an embodiment, referring to fig. 7, fig. 7 is a flowchart illustrating an embodiment of step S51 in fig. 6, where the position information can be embedded into the corresponding feature vector through the following steps.

Step S61 is to generate a second position vector of each second cell based on the position information of each second cell in the corresponding second mode sample, and the dimension of the second position vector is the same as that of the corresponding feature vector.

Taking an image sample as a second modality sample for example, generating a second position vector of each salient region based on the position information of each salient region in the image sample, that is, vectorizing the position information, wherein the dimension of the second position vector is the same as that of the corresponding feature vector, so that the feature vector is convenient to embed.

Since the second position vector contains the row parameters and the column parameters of the salient region, half of the elements of the second position vector may be used to encode the row parameters and the other half of the elements to encode the column parameters. Wherein, the elements with even sequence numbers in the second position vector accord with a sine rule, and the elements with odd sequence numbers accord with a cosine rule. The specific calculation process of the second position vector is the same as that of the first position vector, and can be referred to the formula (1).

Step S62, replacing the feature vector of each second unit in each second unit vector set with a corresponding second sequential encoding vector to obtain a plurality of second sequential encoding vector sets, where each second sequential encoding vector is the sum of the second position vector and the feature vector of the corresponding second unit.

After the second position vectors of the second cells in each second cell vector set are obtained, the second position vectors and the corresponding feature vectors may be added to obtain second order encoding vectors, so as to obtain a plurality of second order encoding vector sets, and the specific process may refer to the above equation (2).

That is, PE in the above formula (1) is represented as a second position vector, pos is represented as a position of the second cell in the second modality sample, and (D + 1) is represented as a dimension of the second position vector PE, X in the above formula (2) is represented as a feature vector of the second cell, and X' is represented as a corresponding second sequential encoding vector, and then the formula (1) and the formula (2) are used to obtain a plurality of second sequential encoding vector sets.

The embodiment may obtain the second location vector by vectorizing the location information of the second unit, and use the sum of the second location vector and the corresponding feature vector as the second sequential encoding vector, so that the location information of the second unit in the second modal sample is retained, and the accuracy of the retrieval result can be improved.

Step S52, the first sequence coding vector set is pooled by utilizing a pooling network to obtain a first global vector.

Specifically, referring to fig. 8, fig. 8 is a flowchart illustrating an embodiment of step S52 in fig. 6, and the pooling network may be implemented by the following steps to pool the second sequential encoding vector sets respectively.

And S71, clustering each second sequence coding vector set based on the soft distribution function to obtain a plurality of classes.

The clustering operation may place several second-order coded vectors with the closest distance into one class, and the specific process may refer to the above formula (3). It should be noted that there are a plurality of second order encoding vector sets, and each second order encoding vector set may be clustered to obtain a plurality of classes corresponding to each second order encoding vector set.

And step S72, respectively acquiring the first pooling characteristics of each class corresponding to each second sequence coding vector set based on the first full connection layer, and respectively acquiring the second pooling characteristics of each class corresponding to each second sequence coding vector set based on the second full connection layer.

After the clustering operation, for each second sequential encoding vector set, a first pooled feature of each class may be obtained based on the first fully-connected layer, and a second pooled feature of each class may be obtained based on the second fully-connected layer, which may be referred to above formula (4).

Step S73, respectively inputting the sum of the first pooling characteristics and the second pooling characteristics of all classes corresponding to each second sequence coding vector set into a third full-link layer to obtain a plurality of second global vectors.

For each second sequence coding vector set, after the first pooling feature and the second pooling feature of each class are obtained respectively, the sum of the first pooling features of all classes and the sum of the second pooling features of all classes are calculated, and then the sum of the first pooling features and the sum of the second pooling features of all classes are input into the third full-link layer, that is, the sum of the first pooling features and the second pooling features of all classes can be input into the third full-link layer, so as to obtain a second global vector corresponding to each second sequence coding vector set, and the specific process can refer to the above formula (5).

That is, K in the above equation (3) is expressed as the number, x ', of classes corresponding to the second-order encoding vector set' _i Representing as the second-order coded vector in the kth class corresponding to each set of second-order coded vectors, and representing N in the above equation (4) as the second-order coded vector x 'in the kth class corresponding to each set of second-order coded vectors' _i P in the above formula (5) _OEM (X') represents a second global vector corresponding to each second order encoded vector set, and a plurality of second global vectors are obtained using formula (3), formula (4), and formula (5).

In this embodiment, a plurality of second sequential encoding vector sets including the position information may be pooled through a plurality of full connection layers, so as to obtain a second global vector corresponding to each second modality sample, respectively, and accuracy of the search result may be improved.

Based on the same inventive concept, the present application further provides a cross-modal retrieval apparatus, please refer to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of the cross-modal retrieval apparatus of the present application, where the retrieval apparatus includes a first feature extraction module 110, a second feature extraction module 120, a pooling module 130, a first embedding module 140, a second embedding module 150, and an obtaining module 160, where a first modal sample as a retrieval target is an input of the first feature extraction module 110, a plurality of second modal samples as candidate objects may be inputs of the second feature extraction module 120, outputs of the first feature extraction module 110 and the second feature extraction module 120 may be inputs of the pooling module 130, an output of the pooling module 130 corresponding to the first feature extraction module 110 may be an input of the first embedding module 140, an output of the pooling module 130 corresponding to the second feature extraction module 120 may be an input of the second embedding module 150, outputs of the first embedding module 140 and the second embedding module 150 may be an input of the obtaining module 160, and an output of the obtaining module 160 may be an output of the cross-modal retrieval result.

The first feature extraction module 110 is configured to extract a first unit vector set of the first modality sample, and the second feature extraction module 120 is configured to extract a second unit vector set of each second modality sample, respectively, where the first modality sample includes a plurality of first units, the first unit vector set is composed of feature vectors of all the first units, the second modality sample includes a plurality of second units, and the second unit vector set is composed of feature vectors of all the second units.

The pooling module 130 is configured to pool the first unit vector set to obtain a first global vector, and separately pool each second unit vector set to obtain a plurality of second global vectors, where the first global vector includes location information of the first unit in the first modality sample, and each second global vector includes location information of the corresponding second unit in the second modality sample.

The pooling module 130 includes a sequential encoding module 131 and a full-connect module 132.

When the first unit vector set is input to the pooling module 130, the sequential encoding module 131 is configured to embed the position information of each first unit in the first modality sample into the corresponding feature vector, and output a first sequential encoding vector set, and the full-connection module 132 is configured to pool the first sequential encoding vector set, and output a first global vector. Specifically, the sequential encoding module 131 generates first position vectors of the first units based on the position information of the first units in the first mode sample, and then uses the sum of each first position vector and the corresponding feature vector as a first sequential encoding vector, thereby obtaining a first sequential encoding vector set. Specifically, the full-connection module 132 clusters the first sequential coding vector set based on the soft-distribution function to obtain a plurality of classes, then obtains the first pooling characteristic and the second pooling characteristic of each class, and finally further pools the sum of the first pooling characteristic and the second pooling characteristic of all the classes to obtain the first global vector.

When the plurality of second unit vector sets are input to the pooling module 130, the sequential encoding module 131 is configured to embed the position information of each second unit in the second modality sample into the corresponding feature vector, and output a plurality of second sequential encoding vector sets, and the full-connection module 132 is configured to pool each second sequential encoding vector set, and output a plurality of second global vectors. Specifically, the sequential encoding module 131 generates a second position vector of each second unit based on the position information of each second unit in the corresponding second modality sample, and then uses the sum of each second position vector corresponding to each second modality sample and the corresponding feature vector as a second sequential encoding vector, thereby obtaining a plurality of second sequential encoding vector sets. Specifically, the full-connection module 132 clusters each second sequential coding vector set based on the soft distribution function to obtain a plurality of classes, then obtains the first pooling characteristic and the second pooling characteristic of each class corresponding to each second sequential coding vector set, and finally further pools the sum of the first pooling characteristic and the second pooling characteristic of all classes corresponding to each second sequential coding vector set to obtain a plurality of second global vectors.

The first embedding module 140 is configured to project the first global vector to a joint space to obtain a first joint vector, and the second embedding module 150 is configured to project each of the second global vectors to the same joint space to obtain a plurality of second joint vectors.

The obtaining module 160 is configured to obtain a cross-modal search result of the first-modal sample based on the similarity between the first joint vector and each second joint vector, for example, select a second-modal sample corresponding to a specified number of second joint vectors with the highest similarity as the cross-modal search result of the first-modal sample.

According to the embodiment, the position information in the samples in different modes is reserved in the pooling process, so that the accuracy of the retrieval result can be improved.

Based on the same inventive concept, the present application further provides a training method for a cross-modal search model, please refer to fig. 10, where fig. 10 is a schematic flowchart of an embodiment of the training method for the cross-modal search model of the present application, and the training method includes the following steps.

Step S81, respectively inputting a plurality of first modality samples and a plurality of second modality samples into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedded network and a second embedded network, and each first modality sample and/or each second modality sample comprises tag data.

The first modality sample and the second modality sample may be text samples, image samples, audio samples, and the like, and the embodiment takes the first modality sample as a text sample and the second modality sample as an image sample as an example for explanation.

Step S82, a first unit vector set of each first modality sample is respectively extracted by using the first feature extraction network, and a second unit vector set of each second modality sample is respectively extracted by using the second feature extraction network, each first modality sample includes a plurality of first units, each first unit vector set is composed of feature vectors of all first units in the corresponding first modality sample, each second modality sample includes a plurality of second units, and each second unit vector set is composed of feature vectors of all second units in the corresponding second modality sample.

In this embodiment, the first feature extraction network may be used to extract a feature vector of each first unit to obtain a first unit vector set, and the second feature extraction network may be used to extract a feature vector of each second unit to obtain a second unit vector set, where the first unit vector set corresponds to the first modality samples one to one, and the second unit vector set corresponds to the second modality samples one to one.

For example, a text sample includes a plurality of words, an image sample includes a plurality of salient regions, a first feature extraction network includes, but is not limited to, a skip-grammar model (skip-grammar) that can vectorize all the words of the text sample, and a second feature extraction network includes, but is not limited to, a Faster R-CNN model and a ResNet model that can detect objects of interest from the image sample and vectorize salient regions containing the objects of interest.

And S83, pooling each first unit vector set by using a pooling network to obtain a plurality of first global vectors, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein each first global vector contains the position information of the first unit in the corresponding first modal sample, and each second global vector contains the position information of the second unit in the corresponding second modal sample.

In this embodiment, after obtaining the first unit vector set and the second unit vector set, the pooling network is used to pool the first unit vector set and the second unit vector set, and the position information related to the first unit and the second unit is retained in the pooling process, so as to obtain a first global vector and a second global vector, where the first global vector corresponds to the first modal sample one to one, and the second global vector corresponds to the second modal sample one to one. For a specific pooling process, reference may be made to the above embodiments related to the cross-modal search method, which is not described herein again.

Step S84, projecting each first global vector to the joint space by using the first embedded network to obtain a plurality of first joint vectors, and projecting each second global vector to the joint space by using the second embedded network to obtain a plurality of second joint vectors.

After the first global vectors and the second global vectors are obtained through pooling respectively, the first global vectors and the second global vectors need to be projected to the same joint space through an embedded network, and then parameters of an optimized retrieval model can be adjusted. In this embodiment, each first global vector may be projected by using a first embedded network, the obtained first joint vector corresponds to the first modal sample one to one, each second global vector may be projected by using a second embedded network, and the obtained second joint vector corresponds to the second modal sample one to one.

In step S85, parameters of the search model are adjusted based on the first joint vector and the second joint vector.

Wherein the first modality sample and/or the second modality sample carry tag data that is strongly correlated with the corresponding sample. In an embodiment, please refer to fig. 11, fig. 11 is a flowchart illustrating an embodiment of step S85 in fig. 10, in which the parameters of the search model can be adjusted through the following steps.

Step S91, determining a first degree of association between the first modality sample and the second modality sample by using a similarity between the first joint vector and the second joint vector.

After obtaining the plurality of first joint vectors and the plurality of second joint vectors, the present embodiment may obtain similarities between two combinations of all the first joint vectors and all the second joint vectors, so as to determine a first degree of association between the first modality samples and the second modality samples.

Step S92, determining a first association degree between the first modality sample and the second modality sample by using the similarity between the first joint vector and the second joint vector.

After determining the first degree of association between the first modality sample and the second modality sample, the present embodiment may determine the second degree of association according to the tag data. For example, if the second modality sample carries tag data, after vectorizing the tag data, similarity between two combinations of all the first joint vectors and all the tag vectors is obtained to obtain a second degree of association between the first joint vectors and the tag data. A loss function is then calculated based on the first and second degrees of relevance, such that the optimized retrieval model is adjusted according to the loss function. Where the loss function includes, but is not limited to, triplet loss.

In step S93, parameters of the search model are adjusted based on the loss function.

After the loss function is calculated, parameters of the retrieval model can be adjusted based on the loss function, and the retrieval model can continuously shorten the distance between the first relevance and the second relevance through multiple adjustments, so that the first modal sample and the second modal sample with the highest relevance can be screened out, and the training of the retrieval model is completed.

In the process of training the retrieval model, the embodiment retains the relevant position information in the first modality sample and the second modality sample, and can improve the accuracy of the retrieval model.

Based on the same inventive concept, the present application further provides an electronic device, please refer to fig. 12, where fig. 12 is a schematic structural diagram of an embodiment of the electronic device, the electronic device includes a memory 210 and a processor 220 coupled to each other, the memory 210 stores program instructions, and the processor 220 can execute the program instructions to implement the cross-modal search method described in any of the embodiments above, or the training method of the cross-modal search model described in any of the embodiments above. For details, reference may be made to the above embodiments, which are not described herein again.

Based on the same inventive concept, the present application further provides a computer-readable storage medium, please refer to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of the computer-readable storage medium of the present application, and the storage medium 300 stores program instructions 310, and the program instructions 310 can be executed by a processor to implement the cross-modal search method according to any of the above embodiments, or the training method of the cross-modal search model according to any of the above embodiments. For details, reference may be made to the above embodiments, which are not described herein again.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A cross-modal retrieval method, comprising:

extracting a first unit vector set of the first modality samples by using the first feature extraction network, and extracting a second unit vector set of each second modality sample by using the second feature extraction network, wherein the first modality samples comprise a plurality of first units, the first unit vector set is composed of feature vectors of all the first units, each second modality sample comprises a plurality of second units, and each second unit vector set is composed of feature vectors of all the second units in the corresponding second modality sample;

pooling the first unit vector set by using the pooling network to obtain a first global vector, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein the first global vector contains position information of the first unit in the first modal sample, and each second global vector contains corresponding position information of the second unit in the corresponding second modal sample;

projecting the first global vector to a joint space by using the first embedded network to obtain a first joint vector, and projecting each second global vector to the joint space by using the second embedded network to obtain a plurality of second joint vectors;

2. The method according to claim 1, wherein said pooling the first unit vector set with the pooling network to obtain a first global vector comprises:

embedding the position information of each first unit in the first modal sample into a corresponding feature vector to obtain a first sequence coding vector set;

pooling the first sequential coding vector set by using the pooling network to obtain the first global vector;

the pooling network is used for pooling each second unit vector set respectively to obtain a plurality of second global vectors, and the pooling network comprises:

respectively embedding the position information of each second unit in each second modal sample into corresponding feature vectors to obtain a plurality of second sequence coding vector sets;

and pooling each second sequence encoding vector set by using the pooling network to obtain a plurality of second global vectors.

3. The method according to claim 2, wherein the embedding position information of each first unit in the first modality sample into a corresponding feature vector to obtain a first sequential encoding vector set includes:

generating a first position vector of each first unit based on position information of each first unit in the first modal sample, wherein the first position vector has the same dimension as the corresponding feature vector;

replacing the feature vector of each first unit in the first unit vector set with a corresponding first sequence coding vector to obtain the first sequence coding vector set, wherein the first sequence coding vector of each first unit is the sum of the first position vector and the feature vector of the first unit.

4. The retrieving method according to claim 2, wherein the pooling network includes a first fully-connected layer, a second fully-connected layer and a third fully-connected layer, and wherein pooling the first set of sequentially encoded vectors with the pooling network to obtain the first global vector comprises:

clustering the first sequence coding vector set based on a soft distribution function to obtain a plurality of classes;

obtaining a first pooling feature of each of the classes based on the first fully-connected layer, and obtaining a second pooling feature of each of the classes based on the second fully-connected layer;

and inputting the sum of the first pooled feature and the second pooled feature of all the classes into the third fully-connected layer to obtain the first global vector.

5. The method according to claim 2, wherein the embedding the position information of each second unit in each second modality sample into the corresponding feature vector to obtain a plurality of second sequential encoding vector sets comprises:

generating a second position vector of each second unit based on the position information of each second unit in the corresponding second modality sample, wherein the dimension of the second position vector is the same as that of the corresponding feature vector;

replacing the feature vector of each second unit in each second unit vector set with a corresponding second sequence coding vector to obtain a plurality of second sequence coding vector sets, wherein each second sequence coding vector is the sum of the second position vector of the corresponding second unit and the feature vector.

6. The retrieving method according to claim 2, wherein the pooling network includes a first fully-connected layer, a second fully-connected layer and a third fully-connected layer, and wherein pooling each of the second set of sequentially-encoded vectors with the pooling network to obtain a plurality of the second global vectors comprises:

clustering each second sequence coding vector set based on a soft distribution function to obtain a plurality of classes;

respectively acquiring first pooling features of each class corresponding to each second sequence coding vector set based on the first full-link layer, and respectively acquiring second pooling features of each class corresponding to each second sequence coding vector set based on the second full-link layer;

and inputting the sum of the first pooling features and the second pooling features of all the classes corresponding to each second sequence encoding vector set into the third full-link layer respectively to obtain a plurality of second global vectors.

7. The retrieval method according to claim 1, wherein the obtaining the cross-modal retrieval result of the first modal sample based on the similarity between the first joint vector and each of the second joint vectors comprises:

and selecting second modal samples corresponding to the specified number of second joint vectors with the highest similarity as cross-modal retrieval results of the first modal samples.

8. A training method of a cross-modal search model is characterized by comprising the following steps:

inputting a plurality of first modality samples and a plurality of second modality samples into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedding network and a second embedding network, and each of the first modality samples and/or each of the second modality samples carries tag data;

respectively extracting a first unit vector set of each first modality sample by using the first feature extraction network, and respectively extracting a second unit vector set of each second modality sample by using the second feature extraction network, wherein each first modality sample comprises a plurality of first units, each first unit vector set consists of feature vectors of all the first units in the corresponding first modality sample, each second modality sample comprises a plurality of second units, and each second unit vector set consists of a set consisting of feature vectors of all the second units in the corresponding second modality sample;

pooling each first unit vector set by using the pooling network to obtain a plurality of first global vectors, and pooling each second unit vector set by using the pooling network to obtain a plurality of second global vectors, wherein each first global vector contains the position information of the first unit in the corresponding first modal sample, and each second global vector contains the position information of the second unit in the corresponding second modal sample;

projecting each first global vector to a joint space by using the first embedded network to obtain a plurality of first joint vectors, and projecting each second global vector to the joint space by using the second embedded network to obtain a plurality of second joint vectors;

adjusting parameters of the search model based on the first join vector and the second join vector.

9. The training method of claim 8, wherein the adjusting parameters of the search model based on the first joint vector and the second joint vector comprises:

determining a first degree of association between the first modality sample and the second modality sample using a degree of similarity between the first joint vector and the second joint vector;

calculating a loss function based on the first and second degrees of correlation, wherein the second degree of correlation is determined from the label data of the first and/or second modality sample;

adjusting parameters of the search model based on the loss function.

10. An electronic device comprising a memory and a processor coupled to each other, the memory storing program instructions, the processor being capable of executing the program instructions to implement the cross-modal search method of any of claims 1-7, or the training method of the cross-modal search model of any of claims 8-9.

11. A computer-readable storage medium, characterized in that the storage medium has stored thereon program instructions executable by a processor to implement the cross-modal search method of any of claims 1-7, or the training method of the cross-modal search model of any of claims 8-9.

12. An electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is capable of executing the program instructions to implement:

inputting a first modality sample as a retrieval target and a plurality of second modality samples as candidate objects into a retrieval model, wherein the retrieval model comprises a first feature extraction network, a second feature extraction network, a pooling network, a first embedding network and a second embedding network;

13. The electronic device of claim 12, wherein pooling the first set of unit vectors with the pooling network to obtain a first global vector comprises:

pooling the first sequential encoding vector set by using the pooling network to obtain the first global vector;

14. The electronic device of claim 13, wherein the embedding the position information of each first cell in the first modality sample into a corresponding feature vector results in a first set of sequential encoding vectors, comprising:

generating a first position vector of each first unit based on position information of each first unit in the first modality sample, wherein the first position vector has the same dimension as the corresponding feature vector;

15. The electronic device of claim 13, wherein the pooled network comprises a first fully-connected layer, a second fully-connected layer, and a third fully-connected layer, and wherein pooling the first set of sequential encoding vectors with the pooled network to obtain the first global vector comprises:

obtaining a first pooling feature of each of the classes based on the first fully-connected layer and a second pooling feature of each of the classes based on the second fully-connected layer;

16. The electronic device according to claim 13, wherein the embedding the position information of each second unit in each second modality sample into the corresponding feature vector to obtain a plurality of second sequential encoding vector sets comprises:

replacing the feature vector of each second unit in each second unit vector set with a corresponding second sequence coding vector to obtain a plurality of second sequence coding vector sets, wherein each second sequence coding vector is the sum of the second position vector and the feature vector of a corresponding second unit.

17. The electronic device of claim 13, wherein the pooled network comprises a first fully-connected layer, a second fully-connected layer, and a third fully-connected layer, and wherein pooling each of the second set of sequentially-encoded vectors with the pooled network to obtain a plurality of second global vectors comprises:

and respectively inputting the sum of the first pooling characteristics and the second pooling characteristics of all the classes corresponding to each second order coding vector set into the third full-connection layer to obtain a plurality of second global vectors.

18. The electronic device of claim 12, wherein the obtaining cross-modal search results for the first modal sample based on the similarity between the first joint vector and each of the second joint vectors comprises:

19. An electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is capable of executing the program instructions to implement:

adjusting parameters of the search model based on the first and second join vectors.

20. The electronic device of claim 19, wherein the adjusting parameters of the search model based on the first join vector and the second join vector comprises:

calculating a loss function based on the first degree of association and a second degree of association, wherein the second degree of association is determined from the tag data of the first modality sample and/or the second modality sample;

adjusting parameters of the search model based on the loss function.