CN115309930A - Cross-modal retrieval method and system based on semantic identification - Google Patents
Cross-modal retrieval method and system based on semantic identification Download PDFInfo
- Publication number
- CN115309930A CN115309930A CN202210875146.5A CN202210875146A CN115309930A CN 115309930 A CN115309930 A CN 115309930A CN 202210875146 A CN202210875146 A CN 202210875146A CN 115309930 A CN115309930 A CN 115309930A
- Authority
- CN
- China
- Prior art keywords
- modality
- sample
- semantic
- text
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, wherein the method comprises the following steps: acquiring an image feature space, a text feature space and a semantic category label; establishing modality specific feature representation and shared feature representation of an image modality and a text modality; establishing a generation model according to the semantic similarity between the modalities and in the modalities; constructing an identification model of an image mode and a text mode; training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters; acquiring the characteristics of a query sample and the characteristics of a sample in a retrieval sample set according to network parameters; calculating the Euclidean distance from the query sample to each sample in the retrieval sample set; a query sample is retrieved using a cross-modality retriever. The invention solves the problem of reduced cross-modal retrieval efficiency caused by the failure of effectively mining complementary information of multi-modal data in the related technology.
Description
Technical Field
The invention belongs to the technical field of multimedia, and particularly relates to a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification.
Background
In the internet and daily life, people are faced with large data in multimodal forms such as images, text, video, audio, and the like. The existence of huge multimodal databases has greatly stimulated the need for cross-modal retrieval by search engines or digital libraries, such as searching for relevant images with text queries, or relevant videos with audio queries. Unlike conventional single-modality retrieval tasks (e.g., image retrieval) which require both the query sample and the retrieval result to belong to the same modality, cross-modality retrieval is a more flexible application which can provide queries to any modality to find relevant information in different modalities.
Since data of different modalities usually have inconsistent distribution and representation, similarity of data of different modalities cannot be directly measured. To address this problem, many cross-modal retrieval methods have emerged. The traditional method mainly mines the correlation of different modality data by learning linear projection, for example, a method based on typical correlation analysis. With the rapid development of Deep learning technology, a method based on Deep Neural Network (DNN) has become a mainstream method for closing modal gap. Most of the existing methods focus on mining of modality shared information, data of different modalities are mapped to a public space to obtain a public representation, mining and utilization of modality specific information are not considered, and therefore multi-modality data complementary information cannot be effectively mined, and cross-modality retrieval efficiency is reduced.
In order to effectively mine multi-modal data complementary information and solve the problem of modal gap of paired modal characteristics of different modes, a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification are provided.
Disclosure of Invention
The embodiment of the invention provides a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, which at least solve the problem of cross-modal retrieval efficiency reduction caused by the fact that complementary information of multi-modal data cannot be effectively mined in the related technology.
According to an embodiment of the present invention, there is provided a cross-modal retrieval method based on semantic identification, including:
acquiring an image feature space, a text feature space and a semantic category label;
establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
establishing a generation model according to the semantic similarity between the modalities and in the modalities;
constructing an identification model of an image mode and a text mode;
training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters;
acquiring the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to the network parameters;
calculating the Euclidean distance from the query sample to each sample in the retrieval sample set;
query samples are retrieved using a cross-modality retriever.
In an exemplary embodiment, the obtaining of the image feature space, the text feature space and the semantic class labels includes the steps of:
the N image mode sample characteristics for obtaining the training data set are I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature data set I n Is the nth sample feature in the text modality sample feature dataset T;
determining a set of image modality features and text modality feature instancesEach instance of which o n =(i n ,t n ) Includes an image mode feature vectorAnd one isText modal feature vector d i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t ;
Defining semantic class labels l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise nc =0。
In an exemplary embodiment, the establishing modality-specific feature representations and the shared feature representations of the image modality and the text modality includes the steps of:
learning mode-specific feature representations of image and text modalities using two feed-forward sub-networks, i.e., learning the mode-specific feature representations of the image and text modalities using sub-networks having three layers in the image and text modalities, respectivelyAndwherein f is I (I;θ I ) And f T (T;θ T ) Mapping function, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
using a common sub-network to learn a mode-sharing feature representation for each mode, i.e. using a shared two-layer sub-network f And T f Feature representation mapping to a shared space for learning modality sharingAndwhereinAndin order to share the mapping function(s),parameters for two-layer subnetworks to share, d f =d s 。
In an exemplary embodiment, the creating a generative model based on semantic similarity between modalities and within modalities includes the steps of:
using a forward single-layer subnetwork as a classifier to predict labels;
performing inter-modal and intra-modal semantic similarity modeling based on depth metric learning;
the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and an overall loss function of the generative model is calculated.
In an exemplary embodiment, the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:
using a forward single-layer subnetwork activated by Softmax as a classifier, such that the inputsOrTime, output the corresponding probability distribution of the semantic classOr
Based on the probability distribution, a semantic discrimination loss function is constructed asWhere upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.
In an exemplary embodiment, the depth metric learning-based inter-modal and intra-modal semantic similarity modeling includes the steps of:
calculating Euclidean distances among the features to measure the similarity of the features, and requiring enhancing the similarity of the features with the same semantic category and reducing the similarity of the features with different semantic categories;
Establishing a contrast loss functionWhere h (x) = max (0, x) is a change loss function, E = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.
In an exemplary embodiment, the distinguishing modality-specific feature representations and corresponding modality-sharing feature representations and calculating a total loss function of the generative model comprises the steps of:
using large interval loss functionsTo distinguish between modality-specific feature representations and corresponding modality-shared feature representations, so as to haveEfficiently learning complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;
large space loss functionUsing threshold value ζ vs. distanceAndapplying constraints, requiring distanceAndgreater than ζ, the modality-specific feature representation is distinguished from the modality-shared feature representation.
Combining the semantic identification loss function, the contrast loss function and the large interval loss function to obtain a total loss function of the generated modelWherein alpha and beta are balance factors for balancingL in (1) sd 、L c And L lm 。
In an exemplary embodiment, the constructing the authentication model of the image modality and the text modality includes the steps of:
constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;
assigning an independent modality label vector to each item in each instance;
the penalty-fighting function was constructed as:wherein the content of the first and second substances,is a generated feature representationThe probability of the mode shape of (a),is a generated feature representationModal probability of (a), [ theta ] A Parameters representing a modality classifier, g u Is example o u Each item in u Or t u The true modality tag of (1).
In an exemplary embodiment, the training network to obtain the features of the query sample and to retrieve the features of the samples in the sample set according to the countermeasure between the generative model and the discriminative model comprises the steps of:
obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. adopting min-max game to perform two concurrent sub-processesAndoptimizing:
assume that a query sample of an image modality has a feature vector ofThe feature vector of a query sample of the text modality isThe image mode searches the sample in the sample set and has the characteristics ofThe text modal search sample set is characterized byWherein the content of the first and second substances,representing the number of samples in the search sample set;
the modality specific feature representation and the modality sharing feature representation of the image modality query sample are respectively:and
the modal-specific feature representation and the modal-shared feature representation of the text modal query sample are respectively as follows:and
the modality specific feature representation and modality shared feature representation of the samples in the image modality retrieval sample set are respectively:andwherein the content of the first and second substances,and
the modal specific feature representation and the modal sharing feature representation of the samples in the text modal retrieval sample set are respectively as follows:andwherein the content of the first and second substances,and
in one exemplary embodiment, the calculating the euclidean distance between the query sample and each sample in the retrieved sample set comprises the steps of:
query sample for image modalitiesComputing query samples for image modalities using distance calculation formulasRetrieving each sample in a sample set to a text modalityOf (2) isWherein the content of the first and second substances,representing query samplesThe combined modality-specific feature representation and modality-shared feature representation ofThe euclidean distance of the combined modality-specific feature representation and modality-sharing feature representation of (a);
query sample for text modalityComputing query samples for text modalities using distance computation formulasRetrieving each sample in a set of samples to an image modalityIs a distance ofWherein the content of the first and second substances,representing query samplesThe combined modality-specific feature representation and modality-shared feature representation ofThe modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In one exemplary embodiment, the retrieving a query sample using a cross-modality retriever includes the steps of:
for the calculated distanceSequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the text retrieval sample set as retrieval results, wherein K is a set query parameter;
for the calculated distanceAnd sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
According to another embodiment of the present invention, there is provided a cross-modal search system based on semantic discrimination, including:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.
The cross-modal retrieval method and the cross-modal retrieval system based on semantic identification have the advantages that:
(1) The mode gap of paired mode sharing features from different modes can be reduced by first learning the mode specific feature representation of each mode using two feed-forward subnetworks, then learning the mode sharing feature representation of each mode using a common subnetwork, and combining the learned mode specific feature representations with the shared feature representation, which facilitates efficient cross-mode retrieval.
(2) The countermeasures are adopted for network training, a generation model in the network is used for learning and predicting semantic labels of the mode specific feature representation and the mode sharing feature representation, modeling similarity between modes and similarity in the modes based on label information, and difference between the mode specific feature representation and the mode sharing feature representation is ensured, so that the learned features have semantic distinctiveness between the modes and in the modes, and complementary information of multi-mode data can be effectively mined.
(3) And performing network training by adopting a countermeasure mechanism, wherein the authentication model in the network is used for learning the modal information of the shared characteristic of the authentication modes so as to improve the modal invariance.
(4) The mode classifier of the sub-network with two layers is used as a countermeasure establishing mode classifier, the mode information in the unknown mode sharing feature representation is identified, and the difference between the modes can be effectively reduced.
Drawings
FIG. 1 is a flow chart of a cross-modal retrieval method based on semantic identification according to an embodiment of the present invention;
FIG. 2 is a flow chart of substep S01 of an embodiment of the present invention;
FIG. 3 is a flow chart of substep S02 of an embodiment of the present invention;
FIG. 4 is a flow chart of substep S03 of an embodiment of the present invention;
FIG. 5 is a flow chart of sub-step S031 of an embodiment of the present invention;
FIG. 6 is a flowchart of sub-step S032 according to an embodiment of the present invention;
fig. 7 is a flow chart of sub-step S033 of an embodiment of the present invention;
FIG. 8 is a flow chart of substep S04 of an embodiment of the present invention;
FIG. 9 is a flow chart of substep S05 of an embodiment of the present invention;
FIG. 10 is a flowchart of substep S06 of an embodiment of the present invention;
FIG. 11 is a flow chart of substep S07 of an embodiment of the present invention;
FIG. 12 is a structural diagram of a cross-modal search system based on semantic identification according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art to further understand the invention, but are not intended to limit the invention in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention provides a cross-modal retrieval method based on semantic identification, which is shown in a flow chart of figure 1 and comprises the following steps:
s01, acquiring an image feature space, a text feature space and a semantic category label;
s02, establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
s03, establishing a generation model according to the semantic similarity between the modalities and in the modalities;
s04, constructing an identification model of an image mode and a text mode;
s05, training a network to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to a countermeasure mechanism between the generation model and the identification model;
s06, calculating Euclidean distances from the query sample to each sample in the retrieval sample set;
and S07, retrieving the query sample by using a cross-modal retriever.
In an exemplary embodiment, the step S01, shown in fig. 2, includes:
step S011, acquiring N image mode sample characteristics of a training data set as I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature data set I n Is the nth sample feature in the text modality sample feature data set T;
step S012, determining image mode characteristic and text mode characteristic example setEach instance of which o n =(i n ,t n ) Includes an image mode feature vectorAnd a text modal feature vectord i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t ;
Step S014, defining semantic category label l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, wherein C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise l nc =0。
In an exemplary embodiment, the step S02, as shown in fig. 3, includes:
step S021, adopting two feedforward sub-networks to learn the mode specific feature representation of the image mode and the text mode, namely, respectively using the sub-networks with three layers to learn the mode specific feature representation of the image mode and the text mode in the image mode and the text modeAndwherein f is I (I;θ I ) And f T (T;θ T ) Mapping function, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
step S022, using a common sub-network to learn the mode sharing feature representation of each mode, i.e. using a shared two-layer sub-network f And T f Feature representation mapping to a shared space for learning modality sharingAndwhereinAndin order to share the mapping function(s),for parameters of a shared two-layer subnetwork, d f =d s 。
In the embodiment, the three fully-connected layers with the dimensionality [1024,512,128 ] are respectively used in the image modality and the text modality]Three-layer sub-network learning of modality-specific feature representations of image modality and text modalityAndwherein f is I (I;θ I ) And f T (T;θ T ) Mapping functions, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively.
With shared layers made up of two fully-connected layers and having dimensions [128,128 ]]Two-layer sub-network of I f And T f Feature representation mapping to a shared space for learning modality sharingAndwhereinAndin order to share the mapping function(s),for parameters of a shared two-layer subnetwork, d f =d s 。
In an exemplary embodiment, the step S03, shown in fig. 4, includes:
step S031, label prediction is carried out by using a forward single-layer sub-network as a classifier;
step S032, performing inter-modal and intra-modal semantic similarity modeling based on depth measurement learning;
step S033, distinguishing the modality specific feature representation and the corresponding modality shared feature representation, and calculating a total loss function of the generative model.
In an exemplary embodiment, the sub-step S031, whose flow chart is shown in fig. 5, includes:
step S0311, using the forward single-layer subnetwork activated by Softmax as a classifier, such that the input isOrTime, output the corresponding probability distribution of the semantic classOr
Step S0312, according to the probability distribution, construct the semantic discrimination loss function asWhere upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.
In this embodiment, in order to make the generated features semantically discriminative, a forward single-layer subnetwork activated by Softmax is used as a classifier, so that the input isOrWhen it is, canOutputting corresponding probability distributions for semantic categoriesOr
From the probability distribution, the present embodiment defines the semantic discrimination loss as shown in equation (1):
where υ represents the parameters of the classifier and U represents the number of instances in each small batch. In the embodiment of the present invention, the dimension of the single-layer network is equal to the number of semantic categories.
In an exemplary embodiment, the step of judging whether the sample semantics are related between the modalities is to calculate a semantic relevance value according to the number of semantic relevant words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects and judge the semantic relevance of the samples according to the fact that the semantic relevance value is larger than a preset semantic relevance threshold.
The number of semantic associated words refers to any item of the number or the proportion of semantically associated texts, and is represented by a variable p.
The similarity degree of the semantic keywords is expressed by a variable q according to the ratio of the number of the same semantic keywords to the total number of the semantic keywords or any item of the influence coefficient of the same semantic keywords on the semantics.
The semantic object similarity is expressed by a variable w according to a ratio of the number of the same semantic objects to the total number of the semantic objects or any item of influence coefficients of the same semantic objects on semantics.
The semantic relevance value is calculated according to the number of semantic associated words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects, and the method comprises the following steps: calculating a semantic association value according to a positive correlation of the number of semantic associated words and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic keywords and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the number of semantic associated words and the positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the similarity of semantic keywords and the positive correlation of the similarity of semantic objects and the semantic association value, calculating any one of the semantic association values according to the number of semantic associated words and the positive correlation of the similarity of semantic keywords and the similarity of semantic objects and the semantic association value, and expressing the semantic association value by using a variable z.
A1-A7 in the table A represent different implementation modes for calculating the semantic relevance value, wherein the number p of semantic relevant words, the similarity degree q of semantic keywords and the similarity degree w of semantic objects related in the table A are obtained by adopting formulas in the implementation modes.
Table a different embodiment for calculating a semantic relevance value
In this embodiment, a preset semantic correlation threshold Z =0.7, and a semantic relevance value Z (e.g., A7) > Z of a modal sample is calculated according to any one of table a, so that the semantic correlation of the sample between modalities is determined.
In an exemplary embodiment, the sub-step S032, shown in fig. 6, includes the following steps:
step S0321, calculate the Euclidean distance between features, i.e. for each pair of instances o u And o v The distance between its features is defined as
Step S03222, establishing a contrast loss functionWhere h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having different semantic labels, | E | and D denote the sizes of the set E and the set D, respectively, and τ is a positive threshold.
In this embodiment, based on the idea of depth metric learning, the similarity of features is measured by calculating the euclidean distance between the features, and it is required to enhance the similarity of features having the same semantic category and reduce the similarity of features having different semantic categories. For each pair of examples o u And o v The distance between features is defined as:
d c (u, v) not only describes intra-modal distances between modality-specific feature representations, but also characterizes modalities between modality-shared feature representationsDistance between each other
Establishing a contrast loss function as shown in equation (3):
where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold. In the embodiment of the present invention, a grid search strategy is used to adjust the hyper-parameter τ to have a search range of [1,10] and a step length of 1, and specifically, it is set to τ =6.
Contrast loss function of equation (3)For distance d c (u, v) constraints are imposed such that the feature distance of pairs of features within a class is less than a threshold τ and the feature distance of pairs of features between classes is greater than τ, thereby facilitating the learning of discriminative features.
In an exemplary embodiment, the step S033 is a flowchart as shown in fig. 7, including:
step S0331, using a large interval loss functionDistinguishing modality-specific feature representations from corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, wherein h (x) = max (0, x), ζ is a positive threshold;
step S0332, large interval loss functionUsing threshold ζ versus distanceAndapplying constraints, requiring distancesAndgreater than ζ, distinguishing modality-specific feature representations from modality-shared feature representations;
step S0333, obtaining a total loss function of the generated model by combining the semantic discrimination loss function, the contrast loss function and the large-interval loss functionWherein alpha and beta are balance factors for balancingL in (1) sd 、L c And L lm 。
In this embodiment, there should be a difference between the modality-specific feature representation and the corresponding modality-shared feature representation for a particular image or text. The embodiment of the invention is designed to use the large interval loss shown as the formula (4) to distinguish the mode specific feature representation and the corresponding mode sharing feature representation so as to effectively learn complementary information in different mode data:
where h (x) = max (0, x), ζ is a positive threshold. In the embodiment of the invention, the grid search strategy is used for adjusting the searching range of the over-parameter zeta to be 1,10]Step size is 1, specifically it is set to ζ =5; distance between two adjacent platesAndare respectively defined as:and
large gap loss of the above formula (4)Using threshold value ζ vs. distanceAndapplying constraints, requiring distancesAndlarger than ζ, act as a feature component discriminator, i.e. distinguish a modality-specific feature representation from a modality-shared feature representation;
combining the semantic discrimination loss (equation 1), the contrast loss (equation 3) and the large interval loss (equation 4) to obtain a total loss function of the generated model:
wherein alpha and beta are balance factors for balancingL in (1) sd 、L c And L lm These three terms. In the embodiment of the invention, a grid search strategy is used for adjusting balance factors alpha and beta, and the search range of alpha and beta is [0.01,100%]The step size is a multiple of 10, specifically, α =0.1 and β =0.1 are set.
In an exemplary embodiment, the step S04, shown in fig. 8, includes:
step S041, constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;
step S042, distributing an independent modal label vector for each item in each instance;
step S043, constructing a resistance loss function as follows:wherein the content of the first and second substances,is a generated feature representationThe probability of the mode shape of (a),is a generated feature representationModal probability of θ A Parameter, g, representing a modality classifier u Is example o u Each item i in u Or t u The true modality tag of (1).
In this embodiment, to reduce the difference between modalities, the present invention uses a sub-network with two layers to construct a modality classifier as a competitor, and the goal of the constructed modality classifier is to identify modality information in the unknown modality sharing feature representation. In the embodiment of the invention, a modal classifier is constructed by using a two-layer network which has the dimension of [64,2] and is activated by a ReLU function and is used as a competitor, and a Softmax activation function is used behind the last layer;
assigning an independent modality label vector to each item in each instance to indicate whether it belongs to an image modality or a text modality;
constructing the resistance loss function is shown in equation (6):
wherein, the first and the second end of the pipe are connected with each other,is a generated feature representationThe probability of the mode shape of (a),is the generated feature representationModal probability of (a), theta A Parameters representing a modality classifier, g u Is example o u Each item (i) in (b) u Or t u ) The real modality tag of (1).
Using the penalty function L of equation (6) above adv (θ A ) The heterogeneous difference between the modes can be effectively reduced.
In an exemplary embodiment, the step S05, as shown in fig. 9, includes:
step S051, obtaining the optimal feature representation through the loss function of the combined minimization generation model and the identification model, namely, adopting the min-max game to carry out the following two concurrent sub-processesAndoptimizing;
step S052, recording the feature vector of a query sample of the image mode asThe feature vector of a query sample of the text modality is noted asThe image mode searches the sample in the sample set and has the characteristics ofThe text modal search sample set is characterized byWherein, the first and the second end of the pipe are connected with each other,representing the number of samples in the search sample set;
step S053, the modality specific feature representation and the modality shared feature representation of the image modality query sample are respectively:andthe modal-specific feature representation and the modal-shared feature representation of the text modal query sample are respectively as follows:and
step S054, the modality specific feature representation and the modality sharing feature representation of the image modality retrieval sample set are respectively:andwherein, the first and the second end of the pipe are connected with each other,andthe modal specific feature representation and modal shared feature representation of the samples in the text modal search sample set are respectively as follows:andwherein, the first and the second end of the pipe are connected with each other,and
in this embodiment, the optimal feature representation is obtained by jointly minimizing the loss functions of the generative model and the identification model, and since the optimization objectives of the generative model and the identification model are opposite, the method adopts the min-max game to optimize the following two concurrent sub-processes, which are respectively expressed by the following equations (7) and (8):
the min-max game described above is implemented using a random gradient descent algorithm. For better min-max optimization, the embodiment of the present invention adds a Gradient inversion Layer (GRL) before the first Layer of the modality classifier, and the embodiment of the present invention sets the Batch Size (Batch Size) of the data set to 128 during the training process.
Assume that a query sample of an image modality has a feature vector ofImage modality retrieval sample setIs characterized byWherein the content of the first and second substances,representing the number of samples in a search sample set
Assume that a query sample of text modalities has a feature vector ofThe text modal search sample set is characterized byWherein, the first and the second end of the pipe are connected with each other,representing the number of samples in the search sample set;
the modality-specific feature representation and the modality-shared feature representation of the image modality query sample are respectively:and
the modality specific feature representation and the modality sharing feature representation of the text modality query sample are respectively as follows:and
the modality specific feature representation and modality shared feature representation of the samples in the image modality retrieval sample set are respectively:andwherein the content of the first and second substances,and
the modality specific feature representation and the modality sharing feature representation of the samples in the text modality retrieval sample set are respectively as follows:andwherein, the first and the second end of the pipe are connected with each other,and
in an exemplary embodiment, the step S06, shown in fig. 10, includes:
step S061, query sample for image modalityComputing query samples for image modalities using distance computation formulasRetrieving each sample in a sample set to a text modalityIs a distance ofWherein the content of the first and second substances,representing query samplesThe combined modality-specific feature representation and modality-shared feature representation ofThe euclidean distance of the combined modality-specific feature representation and modality-sharing feature representation of (a);
step S062, query sample for text modeComputing query samples for text modalities using distance computation formulasRetrieving each sample in a set of samples to an image modalityOf (2) isWherein the content of the first and second substances,representing query samplesBy a modality-specific feature representation and a modality-shared feature representation of the userThe modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In this embodiment, query samples for image modalitiesComputing query samples for image modalities using distance calculation formulasRetrieving each sample in a sample set to a text modalityIs a distance ofWherein the content of the first and second substances,representing query samplesThe combined modality-specific feature representation and modality-shared feature representation ofThe modality-specific feature representation and the modality-sharing feature representation of (2) represent the euclidean distance of the combined features. Query sample for text modalityComputing query samples for text modalities using distance computation formulasRetrieving each sample in a set of samples to an image modalityOf (2) isWherein the content of the first and second substances,representing query samplesModality-specific feature representation and modality sharing features ofCharacterizing features and samples after unionThe modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In an exemplary embodiment, the step S07, shown in fig. 11, includes:
step S071 of calculating the obtained distanceSequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in a text retrieval sample set as retrieval results, wherein K is a set query parameter;
step S072, the calculated distanceAnd sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
In this embodiment, the calculated distance is comparedSorting according to the sequence from small to large, and then taking samples corresponding to the front K minimum distances in a text retrieval sample set as retrieval results; for the calculated distanceAnd sorting according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
A cross-modal retrieval system based on semantic identification according to an embodiment of the present invention is shown in fig. 12, and includes:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.
Of course, those skilled in the art should realize that the above embodiments are only used for illustrating the present invention, and not as a limitation of the present invention, and that changes and modifications to the above embodiments are within the scope of the present invention.
Claims (10)
1. A cross-modal retrieval method based on semantic identification is characterized by comprising the following steps:
acquiring an image feature space, a text feature space and a semantic category label;
establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
establishing a generation model according to the semantic similarity between the modalities and in the modalities;
constructing an identification model of an image mode and a text mode;
training a network according to a countermeasure mechanism between the generation model and the identification model to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set;
calculating the Euclidean distance from the query sample to each sample in the retrieval sample set;
a query sample is retrieved using a cross-modality retriever.
2. The semantic identification-based cross-modal retrieval method of claim 1, wherein the obtaining of the image feature space, the text feature space and the semantic class labels comprises the steps of:
the N image mode sample characteristics for obtaining the training data set are I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature dataset I n Is the first in the text modal sample feature data set Tn sample features;
determining a set of image modality features and text modality feature instancesWherein each instance o n =(i n ,t n ) Includes an image mode feature vectorAnd a text modal feature vectord i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t ;
Defining semantic class labels l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise l nc =0。
3. The semantic identification-based cross-modal retrieval method of claim 2, wherein the establishing of the modal-specific feature representation and the shared feature representation of the image modality and the text modality comprises the steps of:
learning mode-specific feature representations of image and text modalities using two feed-forward sub-networks, i.e., learning the mode-specific feature representations of the image and text modalities using sub-networks having three layers in the image and text modalities, respectivelyAndwherein f is I (I;θ I ) And f T (T;θ T ) Mapping functions, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
using a common subnetwork to learn a mode-sharing feature representation for each mode, i.e. using two layers of shared subnetworks f And T f Feature representation mapping to a shared space for learning modality sharingAndwhereinAndin order to share the mapping function(s),for parameters of a shared two-layer subnetwork, d f =d s 。
4. The semantic identification-based cross-modal search method of claim 3, wherein the modeling of the generation based on the semantic similarity between and within modalities comprises the steps of:
using a forward single-layer subnetwork as a classifier to perform label prediction;
performing inter-modal and intra-modal semantic similarity modeling based on depth metric learning;
the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and a total loss function of the generative model is calculated.
5. The semantic discrimination based cross-modal retrieval method of claim 4 wherein the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:
using a forward single-layer subnetwork activated by Softmax as a classifier, such that the inputsOrTime, output the corresponding probability distribution of the semantic classOr
6. The cross-modal search method based on semantic identification according to claim 5, wherein the depth metric learning based semantic similarity modeling between and within modalities comprises the steps of:
computing the Euclidean distance between features, i.e. o for each pair of instances u And o v The distance between the features is defined as
Establishing a contrast loss functionWhere h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs having the same semantic label in each mini-batch, D = { (u, v) } is a pair-wise index set of feature pairs having different semantic labels in each mini-batch, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.
7. The semantic discrimination based cross-modality retrieval method according to claim 6, wherein the distinguishing of modality specific feature representations and corresponding modality shared feature representations and calculating of the overall loss function of the generative model comprises the steps of:
using large interval loss functionsTo distinguish between modality-specific feature representations and corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;
large space loss functionUsing threshold value ζ vs. distanceAndapplying constraints, requiring distancesAndgreater than ζ, representing and modeling the mode-specific featuresState sharing feature representation differentiation;
8. The semantic identification-based cross-modal retrieval method of claim 7, wherein the building of the identification model of the image modality and the text modality comprises the following steps:
constructing a modality classifier using a sub-network having two layers and using it as a competitor;
assigning an independent modality label vector to each item in each instance;
the penalty-fighting function was constructed as:wherein the content of the first and second substances,is a generated feature representationThe probability of the mode shape of (a),is a generated feature representationModal probability of θ A Parameters representing a modality classifier, g u Is example o u Each item in u Or t u The true modality tag of (1).
9. The cross-modal search method based on semantic identification according to claim 8, wherein the training of the network to obtain the features of the query sample and the features of the samples in the search sample set according to the countermeasure between the generative model and the identification model comprises the steps of:
obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. using min-max game for following two concurrent sub-processesAndoptimizing;
the feature vector of a query sample of an image modality is notedThe feature vector of a query sample of the text modality is notedThe image mode searches the sample in the sample set and has the characteristics ofThe text modal search sample set is characterized byWherein the content of the first and second substances,representing the number of samples in the search sample set;
the modality specific feature representation and the modality sharing feature representation of the image modality query sample are respectively:andthe modality specific feature representation and the modality sharing feature representation of the text modality query sample are respectively as follows:and
the modality specific feature representation and the modality sharing feature representation of the samples in the image modality retrieval sample set are respectively as follows:andwherein, the first and the second end of the pipe are connected with each other,andthe modal specific feature representation and the modal sharing feature representation of the samples in the text modal retrieval sample set are respectively as follows:andwherein, the first and the second end of the pipe are connected with each other,and
10. a cross-modal retrieval system based on semantic discrimination, comprising:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210875146.5A CN115309930A (en) | 2022-07-25 | 2022-07-25 | Cross-modal retrieval method and system based on semantic identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210875146.5A CN115309930A (en) | 2022-07-25 | 2022-07-25 | Cross-modal retrieval method and system based on semantic identification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115309930A true CN115309930A (en) | 2022-11-08 |
Family
ID=83858050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210875146.5A Pending CN115309930A (en) | 2022-07-25 | 2022-07-25 | Cross-modal retrieval method and system based on semantic identification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115309930A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116468037A (en) * | 2023-03-17 | 2023-07-21 | 北京深维智讯科技有限公司 | NLP-based data processing method and system |
CN116821408A (en) * | 2023-08-29 | 2023-09-29 | 南京航空航天大学 | Multi-task consistency countermeasure retrieval method and system |
CN117112829A (en) * | 2023-10-24 | 2023-11-24 | 吉林大学 | Medical data cross-modal retrieval method and device and related equipment |
-
2022
- 2022-07-25 CN CN202210875146.5A patent/CN115309930A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116468037A (en) * | 2023-03-17 | 2023-07-21 | 北京深维智讯科技有限公司 | NLP-based data processing method and system |
CN116821408A (en) * | 2023-08-29 | 2023-09-29 | 南京航空航天大学 | Multi-task consistency countermeasure retrieval method and system |
CN116821408B (en) * | 2023-08-29 | 2023-12-01 | 南京航空航天大学 | Multi-task consistency countermeasure retrieval method and system |
CN117112829A (en) * | 2023-10-24 | 2023-11-24 | 吉林大学 | Medical data cross-modal retrieval method and device and related equipment |
CN117112829B (en) * | 2023-10-24 | 2024-02-02 | 吉林大学 | Medical data cross-modal retrieval method and device and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581961B (en) | Automatic description method for image content constructed by Chinese visual vocabulary | |
CN110647904B (en) | Cross-modal retrieval method and system based on unmarked data migration | |
CN106202256B (en) | Web image retrieval method based on semantic propagation and mixed multi-instance learning | |
CN104899253B (en) | Towards the society image across modality images-label degree of correlation learning method | |
JP5749279B2 (en) | Join embedding for item association | |
CN115309930A (en) | Cross-modal retrieval method and system based on semantic identification | |
CN112905822B (en) | Deep supervision cross-modal counterwork learning method based on attention mechanism | |
CN108038492A (en) | A kind of perceptual term vector and sensibility classification method based on deep learning | |
CN109858015B (en) | Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm | |
CN108446334B (en) | Image retrieval method based on content for unsupervised countermeasure training | |
CN112487822A (en) | Cross-modal retrieval method based on deep learning | |
CN111324765A (en) | Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation | |
CN111080551B (en) | Multi-label image complement method based on depth convolution feature and semantic neighbor | |
CN115221325A (en) | Text classification method based on label semantic learning and attention adjustment mechanism | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
CN112800292A (en) | Cross-modal retrieval method based on modal specificity and shared feature learning | |
CN113076465A (en) | Universal cross-modal retrieval model based on deep hash | |
CN113806582B (en) | Image retrieval method, image retrieval device, electronic equipment and storage medium | |
CN113537304A (en) | Cross-modal semantic clustering method based on bidirectional CNN | |
Huang et al. | Large-scale semantic web image retrieval using bimodal deep learning techniques | |
Dong et al. | Cross-media similarity evaluation for web image retrieval in the wild | |
CN112115253A (en) | Depth text ordering method based on multi-view attention mechanism | |
CN116610831A (en) | Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system | |
TWI452477B (en) | Multi-label text categorization based on fuzzy similarity and k nearest neighbors | |
CN116955650A (en) | Information retrieval optimization method and system based on small sample knowledge graph completion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |