CN115309930A

CN115309930A - Cross-modal retrieval method and system based on semantic identification

Info

Publication number: CN115309930A
Application number: CN202210875146.5A
Authority: CN
Inventors: 董西伟; 潘方; 严军荣
Original assignee: Sunwave Communications Co Ltd
Current assignee: Sunwave Communications Co Ltd
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-11-08

Abstract

The invention discloses a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, wherein the method comprises the following steps: acquiring an image feature space, a text feature space and a semantic category label; establishing modality specific feature representation and shared feature representation of an image modality and a text modality; establishing a generation model according to the semantic similarity between the modalities and in the modalities; constructing an identification model of an image mode and a text mode; training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters; acquiring the characteristics of a query sample and the characteristics of a sample in a retrieval sample set according to network parameters; calculating the Euclidean distance from the query sample to each sample in the retrieval sample set; a query sample is retrieved using a cross-modality retriever. The invention solves the problem of reduced cross-modal retrieval efficiency caused by the failure of effectively mining complementary information of multi-modal data in the related technology.

Description

Cross-modal retrieval method and system based on semantic identification

Technical Field

The invention belongs to the technical field of multimedia, and particularly relates to a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification.

Background

In the internet and daily life, people are faced with large data in multimodal forms such as images, text, video, audio, and the like. The existence of huge multimodal databases has greatly stimulated the need for cross-modal retrieval by search engines or digital libraries, such as searching for relevant images with text queries, or relevant videos with audio queries. Unlike conventional single-modality retrieval tasks (e.g., image retrieval) which require both the query sample and the retrieval result to belong to the same modality, cross-modality retrieval is a more flexible application which can provide queries to any modality to find relevant information in different modalities.

Since data of different modalities usually have inconsistent distribution and representation, similarity of data of different modalities cannot be directly measured. To address this problem, many cross-modal retrieval methods have emerged. The traditional method mainly mines the correlation of different modality data by learning linear projection, for example, a method based on typical correlation analysis. With the rapid development of Deep learning technology, a method based on Deep Neural Network (DNN) has become a mainstream method for closing modal gap. Most of the existing methods focus on mining of modality shared information, data of different modalities are mapped to a public space to obtain a public representation, mining and utilization of modality specific information are not considered, and therefore multi-modality data complementary information cannot be effectively mined, and cross-modality retrieval efficiency is reduced.

In order to effectively mine multi-modal data complementary information and solve the problem of modal gap of paired modal characteristics of different modes, a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification are provided.

Disclosure of Invention

The embodiment of the invention provides a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, which at least solve the problem of cross-modal retrieval efficiency reduction caused by the fact that complementary information of multi-modal data cannot be effectively mined in the related technology.

According to an embodiment of the present invention, there is provided a cross-modal retrieval method based on semantic identification, including:

acquiring an image feature space, a text feature space and a semantic category label;

establishing modality specific feature representation and shared feature representation of an image modality and a text modality;

establishing a generation model according to the semantic similarity between the modalities and in the modalities;

constructing an identification model of an image mode and a text mode;

training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters;

acquiring the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to the network parameters;

calculating the Euclidean distance from the query sample to each sample in the retrieval sample set;

query samples are retrieved using a cross-modality retriever.

In an exemplary embodiment, the obtaining of the image feature space, the text feature space and the semantic class labels includes the steps of:

the N image mode sample characteristics for obtaining the training data set are I = [ I = ₁ ,...,i _n ,...,i _N ]N text mode samples are characterized by T = [ T ] ₁ ,...,t _n ,...,t _N ]Wherein i _n Is the nth sample feature, t, in the image modality sample feature data set I _n Is the nth sample feature in the text modality sample feature dataset T;

determining a set of image modality features and text modality feature instances

Each instance of which o _n ＝(i _n ,t _n ) Includes an image mode feature vector

And one isText modal feature vector

d _i And d _t Respectively representing characteristic dimensions of an image modality and a text modality, and d _i ≠d _t ；

Determining an image modality feature dataset

And text modal feature data set

Defining semantic class labels l _n ＝[l _n1 ,...,l _nc ,...,l _nC ] ^T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l _nc =1, otherwise _nc ＝0。

In an exemplary embodiment, the establishing modality-specific feature representations and the shared feature representations of the image modality and the text modality includes the steps of:

learning mode-specific feature representations of image and text modalities using two feed-forward sub-networks, i.e., learning the mode-specific feature representations of the image and text modalities using sub-networks having three layers in the image and text modalities, respectively

And

wherein f is _I (I；θ _I ) And f _T (T；θ _T ) Mapping function, theta, for image modality and text modality, respectively _I And theta _T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;

using a common sub-network to learn a mode-sharing feature representation for each mode, i.e. using a shared two-layer sub-network _f And T _f Feature representation mapping to a shared space for learning modality sharing

And

wherein

And

in order to share the mapping function(s),

parameters for two-layer subnetworks to share, d _f ＝d _s 。

In an exemplary embodiment, the creating a generative model based on semantic similarity between modalities and within modalities includes the steps of:

using a forward single-layer subnetwork as a classifier to predict labels;

performing inter-modal and intra-modal semantic similarity modeling based on depth metric learning;

the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and an overall loss function of the generative model is calculated.

In an exemplary embodiment, the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:

using a forward single-layer subnetwork activated by Softmax as a classifier, such that the inputs

Or

Time, output the corresponding probability distribution of the semantic class

Or

Based on the probability distribution, a semantic discrimination loss function is constructed as

Where upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.

In an exemplary embodiment, the depth metric learning-based inter-modal and intra-modal semantic similarity modeling includes the steps of:

calculating Euclidean distances among the features to measure the similarity of the features, and requiring enhancing the similarity of the features with the same semantic category and reducing the similarity of the features with different semantic categories;

for each pair of examples o _u And o _v The distance between the features is defined as

Establishing a contrast loss function

Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.

In an exemplary embodiment, the distinguishing modality-specific feature representations and corresponding modality-sharing feature representations and calculating a total loss function of the generative model comprises the steps of:

using large interval loss functions

To distinguish between modality-specific feature representations and corresponding modality-shared feature representations, so as to haveEfficiently learning complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;

large space loss function

Using threshold value ζ vs. distance

And

applying constraints, requiring distance

And

greater than ζ, the modality-specific feature representation is distinguished from the modality-shared feature representation.

Combining the semantic identification loss function, the contrast loss function and the large interval loss function to obtain a total loss function of the generated model

Wherein alpha and beta are balance factors for balancing

L in (1) _sd 、L _c And L _lm 。

In an exemplary embodiment, the constructing the authentication model of the image modality and the text modality includes the steps of:

constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;

assigning an independent modality label vector to each item in each instance;

the penalty-fighting function was constructed as:

wherein the content of the first and second substances,

is a generated feature representation

The probability of the mode shape of (a),

is a generated feature representation

Modal probability of (a), [ theta ] _A Parameters representing a modality classifier, g _u Is example o _u Each item in _u Or t _u The true modality tag of (1).

In an exemplary embodiment, the training network to obtain the features of the query sample and to retrieve the features of the samples in the sample set according to the countermeasure between the generative model and the discriminative model comprises the steps of:

obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. adopting min-max game to perform two concurrent sub-processes

And

optimizing:

assume that a query sample of an image modality has a feature vector of

The feature vector of a query sample of the text modality is

The image mode searches the sample in the sample set and has the characteristics of

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set;

the modality specific feature representation and the modality sharing feature representation of the image modality query sample are respectively:

and

the modal-specific feature representation and the modal-shared feature representation of the text modal query sample are respectively as follows:

and

the modality specific feature representation and modality shared feature representation of the samples in the image modality retrieval sample set are respectively:

and

wherein the content of the first and second substances,

and

the modal specific feature representation and the modal sharing feature representation of the samples in the text modal retrieval sample set are respectively as follows:

and

wherein the content of the first and second substances,

and

in one exemplary embodiment, the calculating the euclidean distance between the query sample and each sample in the retrieved sample set comprises the steps of:

query sample for image modalities

Computing query samples for image modalities using distance calculation formulas

Retrieving each sample in a sample set to a text modality

Of (2) is

Wherein the content of the first and second substances,

representing query samples

The combined modality-specific feature representation and modality-shared feature representation of

The euclidean distance of the combined modality-specific feature representation and modality-sharing feature representation of (a);

query sample for text modality

Computing query samples for text modalities using distance computation formulas

Retrieving each sample in a set of samples to an image modality

Is a distance of

Wherein the content of the first and second substances,

representing query samples

The modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.

In one exemplary embodiment, the retrieving a query sample using a cross-modality retriever includes the steps of:

for the calculated distance

Sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the text retrieval sample set as retrieval results, wherein K is a set query parameter;

for the calculated distance

And sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.

According to another embodiment of the present invention, there is provided a cross-modal search system based on semantic discrimination, including:

a processor;

a memory;

and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.

The cross-modal retrieval method and the cross-modal retrieval system based on semantic identification have the advantages that:

(1) The mode gap of paired mode sharing features from different modes can be reduced by first learning the mode specific feature representation of each mode using two feed-forward subnetworks, then learning the mode sharing feature representation of each mode using a common subnetwork, and combining the learned mode specific feature representations with the shared feature representation, which facilitates efficient cross-mode retrieval.

(2) The countermeasures are adopted for network training, a generation model in the network is used for learning and predicting semantic labels of the mode specific feature representation and the mode sharing feature representation, modeling similarity between modes and similarity in the modes based on label information, and difference between the mode specific feature representation and the mode sharing feature representation is ensured, so that the learned features have semantic distinctiveness between the modes and in the modes, and complementary information of multi-mode data can be effectively mined.

(3) And performing network training by adopting a countermeasure mechanism, wherein the authentication model in the network is used for learning the modal information of the shared characteristic of the authentication modes so as to improve the modal invariance.

(4) The mode classifier of the sub-network with two layers is used as a countermeasure establishing mode classifier, the mode information in the unknown mode sharing feature representation is identified, and the difference between the modes can be effectively reduced.

Drawings

FIG. 1 is a flow chart of a cross-modal retrieval method based on semantic identification according to an embodiment of the present invention;

FIG. 2 is a flow chart of substep S01 of an embodiment of the present invention;

FIG. 3 is a flow chart of substep S02 of an embodiment of the present invention;

FIG. 4 is a flow chart of substep S03 of an embodiment of the present invention;

FIG. 5 is a flow chart of sub-step S031 of an embodiment of the present invention;

FIG. 6 is a flowchart of sub-step S032 according to an embodiment of the present invention;

fig. 7 is a flow chart of sub-step S033 of an embodiment of the present invention;

FIG. 8 is a flow chart of substep S04 of an embodiment of the present invention;

FIG. 9 is a flow chart of substep S05 of an embodiment of the present invention;

FIG. 10 is a flowchart of substep S06 of an embodiment of the present invention;

FIG. 11 is a flow chart of substep S07 of an embodiment of the present invention;

FIG. 12 is a structural diagram of a cross-modal search system based on semantic identification according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art to further understand the invention, but are not intended to limit the invention in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides a cross-modal retrieval method based on semantic identification, which is shown in a flow chart of figure 1 and comprises the following steps:

s01, acquiring an image feature space, a text feature space and a semantic category label;

s02, establishing modality specific feature representation and shared feature representation of an image modality and a text modality;

s03, establishing a generation model according to the semantic similarity between the modalities and in the modalities;

s04, constructing an identification model of an image mode and a text mode;

s05, training a network to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to a countermeasure mechanism between the generation model and the identification model;

s06, calculating Euclidean distances from the query sample to each sample in the retrieval sample set;

and S07, retrieving the query sample by using a cross-modal retriever.

In an exemplary embodiment, the step S01, shown in fig. 2, includes:

step S011, acquiring N image mode sample characteristics of a training data set as I = [ I = ₁ ,...,i _n ,...,i _N ]N text mode samples are characterized by T = [ T ] ₁ ,...,t _n ,...,t _N ]Wherein i _n Is the nth sample feature, t, in the image modality sample feature data set I _n Is the nth sample feature in the text modality sample feature data set T;

step S012, determining image mode characteristic and text mode characteristic example set

And a text modal feature vector

Step S013, determining image mode characteristic data set

And text modal feature data set

Step S014, defining semantic category label l _n ＝[l _n1 ,...,l _nc ,...,l _nC ] ^T Represents the class of the nth sample, wherein C represents the total number of sample classes, and if the nth sample belongs to class C, then l _nc =1, otherwise l _nc ＝0。

In an exemplary embodiment, the step S02, as shown in fig. 3, includes:

step S021, adopting two feedforward sub-networks to learn the mode specific feature representation of the image mode and the text mode, namely, respectively using the sub-networks with three layers to learn the mode specific feature representation of the image mode and the text mode in the image mode and the text mode

And

step S022, using a common sub-network to learn the mode sharing feature representation of each mode, i.e. using a shared two-layer sub-network _f And T _f Feature representation mapping to a shared space for learning modality sharing

And

wherein

And

in order to share the mapping function(s),

for parameters of a shared two-layer subnetwork, d _f ＝d _s 。

In the embodiment, the three fully-connected layers with the dimensionality [1024,512,128 ] are respectively used in the image modality and the text modality]Three-layer sub-network learning of modality-specific feature representations of image modality and text modality

And

wherein f is _I (I；θ _I ) And f _T (T；θ _T ) Mapping functions, theta, for image modality and text modality, respectively _I And theta _T Parameters of three-layer sub-networks of an image mode and a text mode are respectively.

With shared layers made up of two fully-connected layers and having dimensions [128,128 ]]Two-layer sub-network of I _f And T _f Feature representation mapping to a shared space for learning modality sharing

And

wherein

And

in order to share the mapping function(s),

for parameters of a shared two-layer subnetwork, d _f ＝d _s 。

In an exemplary embodiment, the step S03, shown in fig. 4, includes:

step S031, label prediction is carried out by using a forward single-layer sub-network as a classifier;

step S032, performing inter-modal and intra-modal semantic similarity modeling based on depth measurement learning;

step S033, distinguishing the modality specific feature representation and the corresponding modality shared feature representation, and calculating a total loss function of the generative model.

In an exemplary embodiment, the sub-step S031, whose flow chart is shown in fig. 5, includes:

step S0311, using the forward single-layer subnetwork activated by Softmax as a classifier, such that the input is

Or

Time, output the corresponding probability distribution of the semantic class

Or

Step S0312, according to the probability distribution, construct the semantic discrimination loss function as

In this embodiment, in order to make the generated features semantically discriminative, a forward single-layer subnetwork activated by Softmax is used as a classifier, so that the input is

Or

When it is, canOutputting corresponding probability distributions for semantic categories

Or

From the probability distribution, the present embodiment defines the semantic discrimination loss as shown in equation (1):

where υ represents the parameters of the classifier and U represents the number of instances in each small batch. In the embodiment of the present invention, the dimension of the single-layer network is equal to the number of semantic categories.

In an exemplary embodiment, the step of judging whether the sample semantics are related between the modalities is to calculate a semantic relevance value according to the number of semantic relevant words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects and judge the semantic relevance of the samples according to the fact that the semantic relevance value is larger than a preset semantic relevance threshold.

The number of semantic associated words refers to any item of the number or the proportion of semantically associated texts, and is represented by a variable p.

The similarity degree of the semantic keywords is expressed by a variable q according to the ratio of the number of the same semantic keywords to the total number of the semantic keywords or any item of the influence coefficient of the same semantic keywords on the semantics.

The semantic object similarity is expressed by a variable w according to a ratio of the number of the same semantic objects to the total number of the semantic objects or any item of influence coefficients of the same semantic objects on semantics.

The semantic relevance value is calculated according to the number of semantic associated words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects, and the method comprises the following steps: calculating a semantic association value according to a positive correlation of the number of semantic associated words and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic keywords and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the number of semantic associated words and the positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the similarity of semantic keywords and the positive correlation of the similarity of semantic objects and the semantic association value, calculating any one of the semantic association values according to the number of semantic associated words and the positive correlation of the similarity of semantic keywords and the similarity of semantic objects and the semantic association value, and expressing the semantic association value by using a variable z.

A1-A7 in the table A represent different implementation modes for calculating the semantic relevance value, wherein the number p of semantic relevant words, the similarity degree q of semantic keywords and the similarity degree w of semantic objects related in the table A are obtained by adopting formulas in the implementation modes.

Table a different embodiment for calculating a semantic relevance value

In this embodiment, a preset semantic correlation threshold Z =0.7, and a semantic relevance value Z (e.g., A7) > Z of a modal sample is calculated according to any one of table a, so that the semantic correlation of the sample between modalities is determined.

In an exemplary embodiment, the sub-step S032, shown in fig. 6, includes the following steps:

step S0321, calculate the Euclidean distance between features, i.e. for each pair of instances o _u And o _v The distance between its features is defined as

Step S03222, establishing a contrast loss function

Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having different semantic labels, | E | and D denote the sizes of the set E and the set D, respectively, and τ is a positive threshold.

In this embodiment, based on the idea of depth metric learning, the similarity of features is measured by calculating the euclidean distance between the features, and it is required to enhance the similarity of features having the same semantic category and reduce the similarity of features having different semantic categories. For each pair of examples o _u And o _v The distance between features is defined as:

d _c (u, v) not only describes intra-modal distances between modality-specific feature representations, but also characterizes modalities between modality-shared feature representationsDistance between each other

Establishing a contrast loss function as shown in equation (3):

where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold. In the embodiment of the present invention, a grid search strategy is used to adjust the hyper-parameter τ to have a search range of [1,10] and a step length of 1, and specifically, it is set to τ =6.

Contrast loss function of equation (3)

For distance d _c (u, v) constraints are imposed such that the feature distance of pairs of features within a class is less than a threshold τ and the feature distance of pairs of features between classes is greater than τ, thereby facilitating the learning of discriminative features.

In an exemplary embodiment, the step S033 is a flowchart as shown in fig. 7, including:

step S0331, using a large interval loss function

Distinguishing modality-specific feature representations from corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, wherein h (x) = max (0, x), ζ is a positive threshold;

step S0332, large interval loss function

Using threshold ζ versus distance

And

applying constraints, requiring distances

And

greater than ζ, distinguishing modality-specific feature representations from modality-shared feature representations;

step S0333, obtaining a total loss function of the generated model by combining the semantic discrimination loss function, the contrast loss function and the large-interval loss function

Wherein alpha and beta are balance factors for balancing

L in (1) _sd 、L _c And L _lm 。

In this embodiment, there should be a difference between the modality-specific feature representation and the corresponding modality-shared feature representation for a particular image or text. The embodiment of the invention is designed to use the large interval loss shown as the formula (4) to distinguish the mode specific feature representation and the corresponding mode sharing feature representation so as to effectively learn complementary information in different mode data:

where h (x) = max (0, x), ζ is a positive threshold. In the embodiment of the invention, the grid search strategy is used for adjusting the searching range of the over-parameter zeta to be 1,10]Step size is 1, specifically it is set to ζ =5; distance between two adjacent plates

And

are respectively defined as:

and

large gap loss of the above formula (4)

Using threshold value ζ vs. distance

And

applying constraints, requiring distances

And

larger than ζ, act as a feature component discriminator, i.e. distinguish a modality-specific feature representation from a modality-shared feature representation;

combining the semantic discrimination loss (equation 1), the contrast loss (equation 3) and the large interval loss (equation 4) to obtain a total loss function of the generated model:

wherein alpha and beta are balance factors for balancing

L in (1) _sd 、L _c And L _lm These three terms. In the embodiment of the invention, a grid search strategy is used for adjusting balance factors alpha and beta, and the search range of alpha and beta is [0.01,100%]The step size is a multiple of 10, specifically, α =0.1 and β =0.1 are set.

In an exemplary embodiment, the step S04, shown in fig. 8, includes:

step S041, constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;

step S042, distributing an independent modal label vector for each item in each instance;

step S043, constructing a resistance loss function as follows:

wherein the content of the first and second substances,

is a generated feature representation

The probability of the mode shape of (a),

is a generated feature representation

Modal probability of θ _A Parameter, g, representing a modality classifier _u Is example o _u Each item i in _u Or t _u The true modality tag of (1).

In this embodiment, to reduce the difference between modalities, the present invention uses a sub-network with two layers to construct a modality classifier as a competitor, and the goal of the constructed modality classifier is to identify modality information in the unknown modality sharing feature representation. In the embodiment of the invention, a modal classifier is constructed by using a two-layer network which has the dimension of [64,2] and is activated by a ReLU function and is used as a competitor, and a Softmax activation function is used behind the last layer;

assigning an independent modality label vector to each item in each instance to indicate whether it belongs to an image modality or a text modality;

constructing the resistance loss function is shown in equation (6):

wherein, the first and the second end of the pipe are connected with each other,

is a generated feature representation

The probability of the mode shape of (a),

is the generated feature representation

Modal probability of (a), theta _A Parameters representing a modality classifier, g _u Is example o _u Each item (i) in (b) _u Or t _u ) The real modality tag of (1).

Using the penalty function L of equation (6) above _adv (θ _A ) The heterogeneous difference between the modes can be effectively reduced.

In an exemplary embodiment, the step S05, as shown in fig. 9, includes:

step S051, obtaining the optimal feature representation through the loss function of the combined minimization generation model and the identification model, namely, adopting the min-max game to carry out the following two concurrent sub-processes

And

optimizing;

step S052, recording the feature vector of a query sample of the image mode as

The feature vector of a query sample of the text modality is noted as

The text modal search sample set is characterized by

representing the number of samples in the search sample set;

step S053, the modality specific feature representation and the modality shared feature representation of the image modality query sample are respectively:

and

and

step S054, the modality specific feature representation and the modality sharing feature representation of the image modality retrieval sample set are respectively:

and

and

the modal specific feature representation and modal shared feature representation of the samples in the text modal search sample set are respectively as follows:

and

and

in this embodiment, the optimal feature representation is obtained by jointly minimizing the loss functions of the generative model and the identification model, and since the optimization objectives of the generative model and the identification model are opposite, the method adopts the min-max game to optimize the following two concurrent sub-processes, which are respectively expressed by the following equations (7) and (8):

the min-max game described above is implemented using a random gradient descent algorithm. For better min-max optimization, the embodiment of the present invention adds a Gradient inversion Layer (GRL) before the first Layer of the modality classifier, and the embodiment of the present invention sets the Batch Size (Batch Size) of the data set to 128 during the training process.

Assume that a query sample of an image modality has a feature vector of

Image modality retrieval sample setIs characterized by

Wherein the content of the first and second substances,

representing the number of samples in a search sample set

Assume that a query sample of text modalities has a feature vector of

The text modal search sample set is characterized by

representing the number of samples in the search sample set;

the modality-specific feature representation and the modality-shared feature representation of the image modality query sample are respectively:

and

the modality specific feature representation and the modality sharing feature representation of the text modality query sample are respectively as follows:

and

and

wherein the content of the first and second substances,

and

the modality specific feature representation and the modality sharing feature representation of the samples in the text modality retrieval sample set are respectively as follows:

and

and

in an exemplary embodiment, the step S06, shown in fig. 10, includes:

step S061, query sample for image modality

Computing query samples for image modalities using distance computation formulas

Retrieving each sample in a sample set to a text modality

Is a distance of

Wherein the content of the first and second substances,

representing query samples

step S062, query sample for text mode

Computing query samples for text modalities using distance computation formulas

Retrieving each sample in a set of samples to an image modality

Of (2) is

Wherein the content of the first and second substances,

representing query samples

By a modality-specific feature representation and a modality-shared feature representation of the user

In this embodiment, query samples for image modalities

Retrieving each sample in a sample set to a text modality

Is a distance of

Wherein the content of the first and second substances,

representing query samples

The modality-specific feature representation and the modality-sharing feature representation of (2) represent the euclidean distance of the combined features. Query sample for text modality

Computing query samples for text modalities using distance computation formulas

Retrieving each sample in a set of samples to an image modality

Of (2) is

Wherein the content of the first and second substances,

representing query samples

Modality-specific feature representation and modality sharing features ofCharacterizing features and samples after union

In an exemplary embodiment, the step S07, shown in fig. 11, includes:

step S071 of calculating the obtained distance

Sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in a text retrieval sample set as retrieval results, wherein K is a set query parameter;

step S072, the calculated distance

In this embodiment, the calculated distance is compared

Sorting according to the sequence from small to large, and then taking samples corresponding to the front K minimum distances in a text retrieval sample set as retrieval results; for the calculated distance

And sorting according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.

A cross-modal retrieval system based on semantic identification according to an embodiment of the present invention is shown in fig. 12, and includes:

a processor;

a memory;

and

Of course, those skilled in the art should realize that the above embodiments are only used for illustrating the present invention, and not as a limitation of the present invention, and that changes and modifications to the above embodiments are within the scope of the present invention.

Claims

1. A cross-modal retrieval method based on semantic identification is characterized by comprising the following steps:

constructing an identification model of an image mode and a text mode;

training a network according to a countermeasure mechanism between the generation model and the identification model to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set;

a query sample is retrieved using a cross-modality retriever.

2. The semantic identification-based cross-modal retrieval method of claim 1, wherein the obtaining of the image feature space, the text feature space and the semantic class labels comprises the steps of:

the N image mode sample characteristics for obtaining the training data set are I = [ I = ₁ ,...,i _n ,...,i _N ]N text mode samples are characterized by T = [ T ] ₁ ,...,t _n ,...,t _N ]Wherein i _n Is the nth sample feature, t, in the image modality sample feature dataset I _n Is the first in the text modal sample feature data set Tn sample features;

Wherein each instance o _n ＝(i _n ,t _n ) Includes an image mode feature vector

And a text modal feature vector

Determining an image modality feature dataset

And text modal feature data set

Defining semantic class labels l _n ＝[l _n1 ,...,l _nc ,...,l _nC ] ^T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l _nc =1, otherwise l _nc ＝0。

3. The semantic identification-based cross-modal retrieval method of claim 2, wherein the establishing of the modal-specific feature representation and the shared feature representation of the image modality and the text modality comprises the steps of:

And

wherein f is _I (I；θ _I ) And f _T (T；θ _T ) Mapping functions, theta, for image modality and text modality, respectively _I And theta _T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;

using a common subnetwork to learn a mode-sharing feature representation for each mode, i.e. using two layers of shared subnetworks _f And T _f Feature representation mapping to a shared space for learning modality sharing

And

wherein

And

in order to share the mapping function(s),

for parameters of a shared two-layer subnetwork, d _f ＝d _s 。

4. The semantic identification-based cross-modal search method of claim 3, wherein the modeling of the generation based on the semantic similarity between and within modalities comprises the steps of:

using a forward single-layer subnetwork as a classifier to perform label prediction;

the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and a total loss function of the generative model is calculated.

5. The semantic discrimination based cross-modal retrieval method of claim 4 wherein the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:

Or

Time, output the corresponding probability distribution of the semantic class

Or

6. The cross-modal search method based on semantic identification according to claim 5, wherein the depth metric learning based semantic similarity modeling between and within modalities comprises the steps of:

computing the Euclidean distance between features, i.e. o for each pair of instances _u And o _v The distance between the features is defined as

Establishing a contrast loss function

Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs having the same semantic label in each mini-batch, D = { (u, v) } is a pair-wise index set of feature pairs having different semantic labels in each mini-batch, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.

7. The semantic discrimination based cross-modality retrieval method according to claim 6, wherein the distinguishing of modality specific feature representations and corresponding modality shared feature representations and calculating of the overall loss function of the generative model comprises the steps of:

using large interval loss functions

To distinguish between modality-specific feature representations and corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;

large space loss function

Using threshold value ζ vs. distance

And

applying constraints, requiring distances

And

greater than ζ, representing and modeling the mode-specific featuresState sharing feature representation differentiation;

obtaining a total loss function of the generated model by combining the semantic identification loss function, the contrast loss function and the large interval loss function

Wherein alpha and beta are balance factors for balancing

L in (1) _sd 、L _c And L _lm 。

8. The semantic identification-based cross-modal retrieval method of claim 7, wherein the building of the identification model of the image modality and the text modality comprises the following steps:

constructing a modality classifier using a sub-network having two layers and using it as a competitor;

assigning an independent modality label vector to each item in each instance;

the penalty-fighting function was constructed as:

wherein the content of the first and second substances,

is a generated feature representation

The probability of the mode shape of (a),

is a generated feature representation

Modal probability of θ _A Parameters representing a modality classifier, g _u Is example o _u Each item in _u Or t _u The true modality tag of (1).

9. The cross-modal search method based on semantic identification according to claim 8, wherein the training of the network to obtain the features of the query sample and the features of the samples in the search sample set according to the countermeasure between the generative model and the identification model comprises the steps of:

obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. using min-max game for following two concurrent sub-processes

And

optimizing;

the feature vector of a query sample of an image modality is noted

The feature vector of a query sample of the text modality is noted

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set;

and

and

the modality specific feature representation and the modality sharing feature representation of the samples in the image modality retrieval sample set are respectively as follows:

and

and

and

and

10. a cross-modal retrieval system based on semantic discrimination, comprising:

a processor;

a memory;

and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the method of any of claims 1-9.