CN114048295A

CN114048295A - Cross-modal retrieval method and system for data processing

Info

Publication number: CN114048295A
Application number: CN202111128176.1A
Authority: CN
Inventors: 冯爱民; 王鸿飞; 刘学军
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-02-15

Abstract

The embodiment of the invention discloses a cross-modal retrieval method for data processing, relates to the field of image-text data processing, and can reduce information loss while having strong cross-modal similarity. The invention comprises the following steps: and inputting sample data to be processed into a pre-processing module with a specific mode, extracting features, and obtaining feature vector information. The resulting feature vector information is mapped to a common subspace through modality-specific sub-networks. And mapping the public representation in the public subspace to a semantic subspace through a first network. And constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator. The invention is suitable for mutual retrieval between the image-text fields.

Description

Cross-modal retrieval method and system for data processing

Technical Field

The invention relates to the field of image-text data processing, in particular to a cross-modal retrieval method and a cross-modal retrieval system for data processing.

Background

Cross-modal retrieval refers to retrieving one type of data as a query and another type of related data. Its flexibility to retrieve across different modalities (such as images and text) has been a major concern in both academia and industry. Computing correlations between multimodal data is a core goal for cross-modality retrieval. However, the potential heterogeneity between modalities leads to incompatibilities, so the key to implementing cross-modality retrieval is how to span the heterogeneous gap between different modalities.

A common method to eliminate cross-modal differences is characterization learning, where data from different modalities is transformed into a common subspace where similarity measures can be performed directly by learning a modality-specific transfer function. However, in the existing scheme, only a part of information in the data set is usually focused in the process of learning conversion, and information loss exists in different degrees on the design of an objective function, so that the performance of the model is limited.

Therefore, how to reduce information loss while having strong cross-modal similarity is a subject to be studied.

Disclosure of Invention

Embodiments of the present invention provide a cross-modal retrieval method and system for data processing, which can reduce information loss while having strong cross-modal similarity.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in one aspect, a cross-modal retrieval method for data processing is provided, including:

and inputting the unprocessed images and texts serving as input samples into respective modality-specific preprocessing modules for feature extraction to respectively obtain original high-dimensional image feature vectors and text feature vectors.

The obtained original feature vectors of the two modes are nonlinearly mapped to a common subspace through respective mode-specific sub-networks, a cross-mode information aggregation constraint is provided based on the common subspace, and global information and fine-grained information in a data set are aggregated under the premise that absolute distance and relative distance are considered at the same time.

The common representation in the common subspace is further mapped to a semantic subspace through a first network, and a semantic constraint is proposed based on potential associations between vector representations in the semantic subspace and sample labels.

And further constructing a mode discriminator by the public representation in the public subspace through a second network, and proposing a mode constraint based on the mode discriminator to distinguish the original modes of each public representation, wherein the mode constraint is opposite to the optimization target of the cross-mode information aggregation constraint, and the mode constraint and the cross-mode information aggregation constraint are mutually confronted by means of mode characteristic information to introduce confrontation learning for the model.

In another aspect, a cross-modal retrieval system for data processing is provided, comprising:

the preprocessing module is used for inputting sample data to be processed into the mode-specific preprocessing module, performing feature extraction and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors;

the processing module is used for mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace; mapping the public representation in the public subspace through a first network to obtain a semantic subspace, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to a semantic constraint; constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator; storing the result of distinguishing the original mode of each public representation into a database module;

the database module is used for storing the output modal result of the processing module;

and the query feedback module is used for receiving a query item sent by the terminal equipment, converting the query item into a public representation, then querying a modal result stored in the database to obtain a public representation which is most similar to the converted public representation and comes from another modality, and feeding back the query result to the terminal equipment. The invention adopts double constraint of two subspaces and antagonistic learning to reduce the information loss in the cross-modal process to the maximum extent, and generates the public representation with stronger cross-modal similarity and semantic distinguishability, thereby realizing that the information loss is reduced while the stronger cross-modal similarity is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a possible implementation manner of a cross-modal search model according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a possible implementation manner of the first model according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a possible implementation manner of the second model according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a method according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The embodiment of the invention provides a cross-modal retrieval method for data processing, in particular to a scheme improvement of a cross-modal retrieval technology, and the main design idea is as follows: as shown in fig. 1, the information loss in the cross-modal process is minimized by adopting the dual constraint and antagonistic learning of two subspaces, and a common representation with stronger cross-modal similarity and semantic distinctiveness is generated. The main method flow is shown in fig. 1 and 4, and comprises the following steps:

and S1, inputting the sample data to be processed into a mode-specific preprocessing module, extracting features, and obtaining feature vector information.

The mode-specific preprocessing module is used for adopting specific different preprocessing models aiming at specific modes. For data of an image modality, preprocessing is carried out by adopting a VGG network, and for data of a text modality, preprocessing is carried out by adopting a BoW network. See figure 2 in particular. The VGG network is a neural network which is specially used for extracting feature information in an image in deep learning, and the effect is very good. BoW is also a common model that deals specifically with text.

The preprocessing module mentioned in this embodiment can be implemented as a code program in practical application, as shown in fig. 2, and we adopt different models to perform feature extraction on original pictures and characters according to different modalities. The VGG network is a neural network specially used for processing pictures in deep learning, has a very good effect, is widely applied to the field of artificial intelligence, and is accepted by people, and the VGG network is taken as a preprocessing module of an image mode. BoW is also a common, widely used model that deals specifically with text, and we consider it as a pre-processing module for the text modality.

The sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors. For example: unprocessed images and texts can be used as input sample pairs and input into respective mode-specific preprocessing modules for feature extraction, so as to respectively obtain original high-dimensional image feature vectors and text feature vectors. It should be noted that there is no specific standard for the height, and the height referred to in this embodiment is generally thousands of dimensions, the low latitude is generally hundreds of dimensions, and the height is relative. The output of both BOW and VGG are three-four-thousand dimensions, while the common subspace dimension is hundreds.

S2, the obtained feature vector information is mapped to a common subspace through a modality-specific sub-network.

And acquiring cross-modal information aggregation constraint based on the common subspace correspondence. For example: the obtained original feature vectors of the two modes can be nonlinearly mapped to a common subspace through respective mode-specific sub-networks, a cross-mode information aggregation constraint is provided based on the common subspace, and global information and fine-grained information in a data set are aggregated under the premise that absolute distance and relative distance are considered at the same time.

Specifically, the cross-modal information aggregation constraint refers to: a common representation based on a common subspace proposes a loss function (also called search loss by some scholars), whose formula is L_r. The part is called cross-modal information aggregation because the retrieval loss of the part aggregates global information and fine-grained information in a data set, the part is more visual and more visual than the retrieval loss, and the retrieval loss is a specific implementation function which is short for name or cross-modal information aggregation constraint. Semantic constraints, which refer to: similar to the above, a semantic loss function is proposed based on vector representation in the semantic subspace, the function formulation being L_s. Modal constraints, meaning: the loss function of the modal discriminator is called modal loss, and the formula is L_d. In actual design, each constraint specifically corresponds to a loss function.

And S3, mapping the common representation in the common subspace to a semantic subspace through a first network.

Wherein a potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to the semantic constraint. For example: the common representation in the common subspace may be further mapped to a semantic subspace through the first network, a semantic constraint being proposed based on potential associations between vector representations in the semantic subspace and sample labels. Specifically, the common subspace is obtained from S2, for example: at s2, the image feature vector (4096 dimensions, for example) is input to the image sub-network, and a 200-dimensional vector is output via the network mapping. The 200-dimensional vector can be regarded as a point in a 200-dimensional space, and the vector is the coordinates of the point. So, via step S2, all feature vectors map to a point in the 200-dimensional space, which is called the common subspace because both the image and the text are in this space, and each point in the space (i.e., each 200-dimensional vector) is called the common representation. First and second networks as shown in fig. 1 and 3, the first network is composed of a fully connected layer, and the output vector can be regarded as a point in a space (the space is called a semantic subspace) according to the reason of the common subspace. The second network consists of three fully connected layers. The modality discriminator is an alternative name that I have worked according to the function of the second network, and its output is not a vector but a value (scalar) that can be used to discriminate the modality.

And S4, constructing a mode discriminator through a second network for the common representation in the common subspace, and distinguishing the original modes of each common representation by using the constructed mode discriminator.

This embodiment works similar to a conventional search engine (e.g., a photo search engine), and when a query term is input, the model converts the query term into a common representation, and then the model searches, for example, a database for a common representation from another modality that is most similar to the common representation, and then returns the corresponding query result to the user. For example, the user enters the word cat and the model returns her pictures of some cats. The user enters a picture and the model returns her some reports or descriptions about this picture. The common representation in the common subspace can further pass through a second network to construct a mode discriminator, a mode constraint is proposed based on the mode discriminator to distinguish the original modes of each common representation, the optimization goal of the mode-crossing information aggregation constraint is opposite, and the mode constraint and the original modes are mutually confronted by means of mode characteristic information to introduce confrontation learning for the model. The invention adopts double constraint of two subspaces and antagonistic learning to reduce information loss in the cross-modal process to the maximum extent, and generates public representation with stronger cross-modal similarity and semantic distinguishability.

In this embodiment, step S1 includes: feature extraction is performed on the image data by a VGG-19 network, where 4096-dimensional vectors output in fc7 layer are obtained as input to the image sub-network in the second model. Processing the text data by a bag of words model (BoW) and generating high-dimensional text feature vectors as input for the text sub-network in the second model. For example: as shown in fig. 2, the unprocessed image and text are used as input samples to be input into respective modality-specific preprocessing modules for feature extraction, and the original high-dimensional image feature vector and text feature vector are obtained as input of a subsequent sub-network. Although it is the first half of the whole cross-modal search model, it does not participate in the training of the whole model, and it is essentially a data preprocessing process. Specifically, for the unprocessed image sample, the present embodiment performs feature extraction using the pre-trained VGG-19 network, obtaining 4096-dimensional vectors output in its fc7 layer as input to the image sub-network in the second model. The unprocessed text sample is passed through the well-known bag of words model (BoW) to generate the original high-dimensional text feature vectors as input to the text sub-network in the second model.

Specifically, in the construction process of the common subspace, the obtained original feature vectors of the two modes can be nonlinearly mapped to the common subspace through respective mode-specific sub-networks. For example: the image sub-network and the text sub-network are respectively composed of three layers of fully-connected neural networks, and the fully-connected neural networks have abundant parameters and have enough capacity in the sub-networks to realize the complex conversion. To better transition from different modes to the same subspace, the present embodiment employs weight sharing at the last level of the two subnetworks. The embodiment provides a cross-modal information aggregation constraint based on a public subspace, for simplicity, the embodiment refers to the constraint as retrieval loss, and aggregates global information and fine-grained information in a data set on the premise of simultaneously considering an absolute distance and a relative distance, so that information loss in a cross-modal process is greatly reduced, and cross-modal similarity of public representation is achieved.

In this embodiment, step S2 includes:

and mapping the obtained image feature vector and the obtained text feature vector to the public subspace in a nonlinear way through respective modality-specific sub-networks, wherein the image sub-network and the text sub-network are respectively composed of three layers of fully-connected neural networks, and a retrieval loss model utilized in the mapping process is composed of three sub-items.

Constructing a first sub-item by using the center loss of the triple, wherein the triple is

Wherein t is_qIs a term of a text query that,

is a positive class center and is associated with a text query term t_qThe label is of the same category as the label,

is a negative class center and is associated with the text query term t_qThe label categories are not the same.

The triad center loss is:

wherein N is₁Denotes the total number of triplets, m₁Represents an adjustable threshold value, i₁、i₂And i₃The representations represent different class centers, respectively, with 1, 2 and 3 in the subscript merely to distinguish that this is a different class center, i indicates that the item belongs to an image modality, and t indicates that the item belongs to a text modality. i is image and t is text.

For example: the specific design mode for searching loss comprises the following steps: the search loss is composed of three sub-items, which are respectively in the form of triples, quadruples and pairs, and aggregates global information and fine-grained information in the data set. Wherein: step 1, constructing a first sub-item, namely a triple center loss, and then explaining by taking a text query image as an example: a triple is defined as

Wherein t is_qIs a term of a text query that,

and

are two different image class centers. This embodiment is called

Class-one center, which is associated with the text query term t_qThe labels are of the same type, this embodiment is called

Is a negative class center, which is associated with the text query term t_qThe label categories are not the same. The triplet center loss is then defined as follows:

wherein N is₁Represents the total number of triplets, m₁Is an adjustable threshold value, [ A ]]₊Max (0, a) is the hinge function, d (a, b) | | | a-b | | non-calculation₂Representing the euclidean distance. The above formula states that the distance of a query term to the corresponding positive class center is less than the distance of the query term to any one of the negative class centers. This sub-term takes into account the absolute distance under the same anchor point and introduces global information for the model using the sample to class center relationship.

The second sub-item is constructed by loss of the center of the quadruple, wherein the quadruple is

Is and

different negative class centers, the quadruple center loss is:

wherein N is₂Denotes the total number of quadruplets, m₂Is another adjustable threshold. For example: and 2, constructing a second sub item, namely the loss of the quadruple center. A quadrupleFormally defined as

And

are any two different negative class centers. Then, the quad center loss is defined as follows:

wherein N is₂Denotes the total number of quadruplets, m₂Is another adjustable threshold. The above formula states that the distance between a query term to a positive class center is less than the distance between any two different negative class centers. The sub-entry takes into account the relative distances under different anchor points and introduces another part of global information in the data set for the model, as a supplement to the triple center loss.

And (3) establishing constraint on the global level by utilizing the triple center loss and the quadruple center loss:

wherein σ_IAnd σ_TThe weight parameters of the image sub-network and the text sub-network, respectively. For example: step 3, by combining the two formulas, this embodiment obtains a constraint on a global level, and the formulas are defined as follows:

wherein σ_IAnd σ_TThe weight parameters of the image sub-network and the text sub-network, respectively. The above formula enables the model to construct a common subspace by using global information, and greatly reduces the difficulty of model update caused by huge difference between sample pairs. All in oneIn this way, the embodiment can obtain the image query text

Constructing a third sub-item:

wherein the content of the first and second substances,

e is a matrix of indicators, i_jRepresenting the j-th image sample and t_kThe kth text sample is represented, j and k respectively represent positive integers and respectively represent, i represents that the item belongs to an image modality, and t represents that the item belongs to a text modality. i is image and t is text. J and k represent sample numbers, respectively. So i_jRepresenting the j-th image sample, t_kKth text sample, E_jkFor representing i_jAnd t_kWhether the categories of (1) are the same or not, and if so, E_jk1, otherwise E_jk0. For example: step 4, constructing a third sub-item, wherein the sub-item is a constraint on a local level based on a sample pair, and a formula is defined as:

wherein the content of the first and second substances,

e is an indicator matrix, E_jkRepresents i_jAnd t_kIs the same. If two samples belong to the same class, E_jk1, otherwise E_jk0. Equation 4 makes the cosine values between samples of the same class as large as possible, while the cosine values between samples of different classes are as small as possible.

And constructing a complete retrieval loss according to the constrained third sub-item on the global level:

where γ is a hyperparameter. For example: step 5, by combining the formulas of step 3 and step 4, the present embodiment obtains a complete search loss, and the formula is defined as follows:

where γ is a hyperparameter. The formula effectively reduces information loss by aggregating constraints on different levels of hierarchy, and cross-modal similarity in common representation.

In this embodiment, in step S3, the first network is formed by a layer of fully-connected neural network, where the semantic constraint is:

wherein σ_sA network parameter indicative of the first network,

representing a vector representation in a semantic subspace, do representing the number of classes of samples in a training dataset, R^doA vector space representing the do dimension,

a label vector representing the corresponding sample. For example: the common representation in the common subspace may be further mapped to a semantic subspace through a first network, based on which a semantic constraint is proposed that exploits the potential semantic association between the sample labels and the vector representation of the semantic subspace to optimize the distribution of the common subspace. As in fig. 3, the first network consists of a layer of fully-connected neural networks. The semantic constraint formula is expressed as follows:

wherein sigma_sIs the network parameter of the portion that is,

is a vector representation in a semantic subspace,

is the label vector of the corresponding sample. The formula not only has semantic discriminability of public representation, but also has a certain regularization effect on the generation process of the public representation.

In this embodiment, in step S4, the mode loss of the mode discriminator is established:

f_bce(x,c)＝c(x)log(p(f_T(t_x；σ_T)))+(1-c(x))log(p(f_I(i_x；σ_I)))

wherein f is_bceRepresenting a binary cross-entropy loss function for modal classification, c (-) represents a modal indicator, c (x) 1 when input x represents text, otherwise c (x) 0, p (-) represents the probability of each mode generated by the input, σ_dRepresenting the parameters of the arbiter. For example: the common representations in the common subspace may further be passed through a second network to construct a modality discriminator, on the basis of which the present embodiment proposes a modality constraint to distinguish the original modalities of each common representation, with the goal of distinguishing as many as possible of the original modalities of the common representation. However, the cross-modal aggregation constraint aims to generate a common representation with cross-modal similarity, which is contrary to the purpose of the modal discriminator. Therefore, the two games play a maximum and minimum game as competitors. For simplicity, the penalty of the discriminator is referred to as the modal penalty, and the modal penalty formula is defined as follows:

f_bce(x,c)＝c(x)log(p(f_T(t_x；σ_T)))+(1-c(x))log(p(f_I(i_x；σ_I)))

wherein f is_bceIs a two-class cross-entropy loss function for modal classification. c (-) is a modal indicator, c (x) 1 when input x is text, otherwise c (x) 0. p (-) is the probability, σ, of each mode of input generation_dIs a parameter of the discriminator. By maximising

The performance and robustness of the model are further improved.

Further, in this embodiment, the modal discriminator may be optimized by an Adam algorithm, wherein in the optimization process, the maximum and minimum game is performed by two parallel sub-processes, including:

specifically, an Adam algorithm may be employed that approximates the true gradient of the mini-batch to update the model. The process of learning the best tokens is to jointly minimize search loss, semantic loss, and modal loss. Since the optimization goals of the search penalty and the modal discriminator are opposite, the process operates as the maximum and minimum game of two parallel sub-processes with the following formula:

the training of the model is actually a process of continuously optimizing k generation processes and one discrimination process until the result of the whole model converges. As with all antagonistic learning methods, the parameters of the "generator" are fixed during the training phase of the "discriminator" and vice versa. As can be seen from FIG. 3, the method of the present invention can greatly reduce the information loss in the cross-modal process, generate a public representation with stronger cross-modal similarity and semantic discrimination, and effectively improve the precision of cross-modal retrieval.

The advantages of this embodiment are: information loss is greatly reduced through cross-modal information aggregation constraints, and stronger cross-modal similarity is achieved in generating common representations. Semantic constraints utilize semantic discriminative properties of semantic information in the sample labels in a common representation. The modal constraint further reduces information loss by using modal fixed information and enhances the robustness of the model.

It should be noted that this embodiment is not a simple calculation method, but can be applied to a retrieval system and assist in improving a search engine. For example, in practical applications, the method of this embodiment may be applied to a system as shown in fig. 5, including:

the preprocessing module is used for inputting sample data to be processed into the mode-specific preprocessing module, performing feature extraction and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors.

The processing module is used for mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace; mapping the public representation in the public subspace through a first network to obtain a semantic subspace, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to a semantic constraint; constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator; and storing results of distinguishing the original modality of each common representation to the database module.

And the database module is used for storing the output modal result of the processing module.

And the query feedback module is used for receiving a query item sent by the terminal equipment, converting the query item into a public representation, then querying a modal result stored in the database to obtain a public representation which is most similar to the converted public representation and comes from another modality, and feeding back the query result to the terminal equipment.

In particular, the embodiment is suitable for mutual retrieval between the image-text fields, that is, the query term is converted into a common representation through the trained model, and the model returns the query result from another modality by further measuring the similarity between the query term and other common representations. For example: working in a manner similar to search engines commonly used today, when a query term is entered, the model will convert the query term into a common representation first, and then the model will look up the common representation from another modality that is most similar to the common representation in, for example, a database, and then return the corresponding query results.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A cross-modal retrieval method for data processing, comprising:

s1, inputting sample data to be processed into a mode-specific preprocessing module, performing feature extraction, and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors;

s2, mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace;

s3, mapping the public representation in the public subspace to a semantic subspace through a first network, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to semantic constraints;

2. The method according to claim 1, further comprising, after step S4:

after receiving a query item sent by a terminal device, converting the query item into a public representation;

querying the modality results stored in the database and obtaining a common representation that is most similar to the converted common representation and is from another modality;

and feeding back the query result to the terminal equipment.

3. The method according to claim 1, wherein step S1 includes:

performing feature extraction on the image data through a VGG-19 network, wherein 4096-dimensional vectors output in an fc7 layer are obtained as input of an image sub-network in the second model;

processing the text data by a bag of words model (BoW) and generating high-dimensional text feature vectors as input for the text sub-network in the second model.

4. The method according to claim 1, wherein step S2 includes:

5. The method of claim 4, further comprising:

Wherein t is_qIs a term of a text query that,

is a negative class center and is associated with the text query term t_qThe label categories are different;

the triad center loss is:

wherein N is₁Denotes the total number of triplets, m₁Represents an adjustable threshold value, i₁、i₂And i₃Respectively represent different class centers;

Is and

different negative class centers, the quadruple center loss is:

wherein N is₂Denotes the total number of quadruplets, m₂Is another adjustable threshold;

wherein σ_IAnd σ_TThe weight parameters of the image sub-network and the text sub-network, respectively.

6. The method of claim 5, further comprising:

constructing a third sub-item:

wherein the content of the first and second substances,

e is a matrix of indicators, i_jRepresenting the j-th image sample and t_kRespectively represent the kth text sample, j and k respectively represent positive integers, E_jkFor representing i_jAnd t_kWhether the categories of (1) are the same or not, and if so, E_jk1, otherwise E_jk＝0；

where γ is a hyperparameter.

7. The method according to claim 1, wherein in step S3, the first network is composed of a layer of fully-connected neural networks, wherein the semantic constraints are:

wherein σ_sA network parameter indicative of the first network,

representing a vector representation in a semantic subspace, do representing the number of classes of samples in a training dataset, R^doVector space, representing do dimensions

A label vector representing the corresponding sample.

8. The method according to claim 1, wherein in step S4, a modal loss of the modal discriminator is established:

f_bce(x,c)＝c(x)log(p(f_T(t_x；σ_T)))+(1-c(x))log(p(f_I(i_x；σ_I)))

wherein f is_bceRepresenting a binary cross-entropy loss function for modality classification, c (-) represents a modality indicator, c (x) ═ 1 when input x represents text, otherwise c (x) ═ 0, p (-) representsInputting the probability, σ, of each modal state generated_dRepresenting the parameters of the arbiter.

9. The method of claim 8, further comprising:

optimizing the modal discriminator by an Adam algorithm, wherein in the optimization process, the maximum and minimum game is carried out by two parallel sub-processes, comprising:

10. a cross-modal retrieval system for data processing, comprising: