CN115309930A - Cross-modal retrieval method and system based on semantic identification - Google Patents

Cross-modal retrieval method and system based on semantic identification Download PDF

Info

Publication number
CN115309930A
CN115309930A CN202210875146.5A CN202210875146A CN115309930A CN 115309930 A CN115309930 A CN 115309930A CN 202210875146 A CN202210875146 A CN 202210875146A CN 115309930 A CN115309930 A CN 115309930A
Authority
CN
China
Prior art keywords
modality
sample
semantic
text
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210875146.5A
Other languages
Chinese (zh)
Inventor
董西伟
潘方
严军荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sunwave Communications Co Ltd
Original Assignee
Sunwave Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sunwave Communications Co Ltd filed Critical Sunwave Communications Co Ltd
Priority to CN202210875146.5A priority Critical patent/CN115309930A/en
Publication of CN115309930A publication Critical patent/CN115309930A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, wherein the method comprises the following steps: acquiring an image feature space, a text feature space and a semantic category label; establishing modality specific feature representation and shared feature representation of an image modality and a text modality; establishing a generation model according to the semantic similarity between the modalities and in the modalities; constructing an identification model of an image mode and a text mode; training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters; acquiring the characteristics of a query sample and the characteristics of a sample in a retrieval sample set according to network parameters; calculating the Euclidean distance from the query sample to each sample in the retrieval sample set; a query sample is retrieved using a cross-modality retriever. The invention solves the problem of reduced cross-modal retrieval efficiency caused by the failure of effectively mining complementary information of multi-modal data in the related technology.

Description

Cross-modal retrieval method and system based on semantic identification
Technical Field
The invention belongs to the technical field of multimedia, and particularly relates to a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification.
Background
In the internet and daily life, people are faced with large data in multimodal forms such as images, text, video, audio, and the like. The existence of huge multimodal databases has greatly stimulated the need for cross-modal retrieval by search engines or digital libraries, such as searching for relevant images with text queries, or relevant videos with audio queries. Unlike conventional single-modality retrieval tasks (e.g., image retrieval) which require both the query sample and the retrieval result to belong to the same modality, cross-modality retrieval is a more flexible application which can provide queries to any modality to find relevant information in different modalities.
Since data of different modalities usually have inconsistent distribution and representation, similarity of data of different modalities cannot be directly measured. To address this problem, many cross-modal retrieval methods have emerged. The traditional method mainly mines the correlation of different modality data by learning linear projection, for example, a method based on typical correlation analysis. With the rapid development of Deep learning technology, a method based on Deep Neural Network (DNN) has become a mainstream method for closing modal gap. Most of the existing methods focus on mining of modality shared information, data of different modalities are mapped to a public space to obtain a public representation, mining and utilization of modality specific information are not considered, and therefore multi-modality data complementary information cannot be effectively mined, and cross-modality retrieval efficiency is reduced.
In order to effectively mine multi-modal data complementary information and solve the problem of modal gap of paired modal characteristics of different modes, a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification are provided.
Disclosure of Invention
The embodiment of the invention provides a cross-modal retrieval method and a cross-modal retrieval system based on semantic identification, which at least solve the problem of cross-modal retrieval efficiency reduction caused by the fact that complementary information of multi-modal data cannot be effectively mined in the related technology.
According to an embodiment of the present invention, there is provided a cross-modal retrieval method based on semantic identification, including:
acquiring an image feature space, a text feature space and a semantic category label;
establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
establishing a generation model according to the semantic similarity between the modalities and in the modalities;
constructing an identification model of an image mode and a text mode;
training a network according to a countermeasure mechanism between the generation model and the identification model and solving network parameters;
acquiring the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to the network parameters;
calculating the Euclidean distance from the query sample to each sample in the retrieval sample set;
query samples are retrieved using a cross-modality retriever.
In an exemplary embodiment, the obtaining of the image feature space, the text feature space and the semantic class labels includes the steps of:
the N image mode sample characteristics for obtaining the training data set are I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature data set I n Is the nth sample feature in the text modality sample feature dataset T;
determining a set of image modality features and text modality feature instances
Figure BDA0003762121140000021
Each instance of which o n =(i n ,t n ) Includes an image mode feature vector
Figure BDA0003762121140000022
And one isText modal feature vector
Figure BDA0003762121140000023
d i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t
Determining an image modality feature dataset
Figure BDA0003762121140000024
And text modal feature data set
Figure BDA0003762121140000025
Defining semantic class labels l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise nc =0。
In an exemplary embodiment, the establishing modality-specific feature representations and the shared feature representations of the image modality and the text modality includes the steps of:
learning mode-specific feature representations of image and text modalities using two feed-forward sub-networks, i.e., learning the mode-specific feature representations of the image and text modalities using sub-networks having three layers in the image and text modalities, respectively
Figure BDA0003762121140000031
And
Figure BDA0003762121140000032
wherein f is I (I;θ I ) And f T (T;θ T ) Mapping function, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
using a common sub-network to learn a mode-sharing feature representation for each mode, i.e. using a shared two-layer sub-network f And T f Feature representation mapping to a shared space for learning modality sharing
Figure BDA0003762121140000033
And
Figure BDA0003762121140000034
wherein
Figure BDA0003762121140000035
And
Figure BDA0003762121140000036
in order to share the mapping function(s),
Figure BDA0003762121140000037
parameters for two-layer subnetworks to share, d f =d s
In an exemplary embodiment, the creating a generative model based on semantic similarity between modalities and within modalities includes the steps of:
using a forward single-layer subnetwork as a classifier to predict labels;
performing inter-modal and intra-modal semantic similarity modeling based on depth metric learning;
the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and an overall loss function of the generative model is calculated.
In an exemplary embodiment, the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:
using a forward single-layer subnetwork activated by Softmax as a classifier, such that the inputs
Figure BDA0003762121140000038
Or
Figure BDA0003762121140000039
Time, output the corresponding probability distribution of the semantic class
Figure BDA00037621211400000310
Or
Figure BDA00037621211400000311
Based on the probability distribution, a semantic discrimination loss function is constructed as
Figure BDA0003762121140000041
Where upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.
In an exemplary embodiment, the depth metric learning-based inter-modal and intra-modal semantic similarity modeling includes the steps of:
calculating Euclidean distances among the features to measure the similarity of the features, and requiring enhancing the similarity of the features with the same semantic category and reducing the similarity of the features with different semantic categories;
for each pair of examples o u And o v The distance between the features is defined as
Figure BDA0003762121140000042
Establishing a contrast loss function
Figure BDA0003762121140000043
Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a set of paired indices of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.
In an exemplary embodiment, the distinguishing modality-specific feature representations and corresponding modality-sharing feature representations and calculating a total loss function of the generative model comprises the steps of:
using large interval loss functions
Figure BDA0003762121140000044
To distinguish between modality-specific feature representations and corresponding modality-shared feature representations, so as to haveEfficiently learning complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;
large space loss function
Figure BDA0003762121140000045
Using threshold value ζ vs. distance
Figure BDA0003762121140000046
And
Figure BDA0003762121140000047
applying constraints, requiring distance
Figure BDA0003762121140000048
And
Figure BDA0003762121140000049
greater than ζ, the modality-specific feature representation is distinguished from the modality-shared feature representation.
Combining the semantic identification loss function, the contrast loss function and the large interval loss function to obtain a total loss function of the generated model
Figure BDA0003762121140000051
Wherein alpha and beta are balance factors for balancing
Figure BDA0003762121140000052
L in (1) sd 、L c And L lm
In an exemplary embodiment, the constructing the authentication model of the image modality and the text modality includes the steps of:
constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;
assigning an independent modality label vector to each item in each instance;
the penalty-fighting function was constructed as:
Figure BDA0003762121140000053
wherein the content of the first and second substances,
Figure BDA0003762121140000054
is a generated feature representation
Figure BDA0003762121140000055
The probability of the mode shape of (a),
Figure BDA0003762121140000056
is a generated feature representation
Figure BDA0003762121140000057
Modal probability of (a), [ theta ] A Parameters representing a modality classifier, g u Is example o u Each item in u Or t u The true modality tag of (1).
In an exemplary embodiment, the training network to obtain the features of the query sample and to retrieve the features of the samples in the sample set according to the countermeasure between the generative model and the discriminative model comprises the steps of:
obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. adopting min-max game to perform two concurrent sub-processes
Figure BDA0003762121140000058
And
Figure BDA0003762121140000059
optimizing:
assume that a query sample of an image modality has a feature vector of
Figure BDA00037621211400000510
The feature vector of a query sample of the text modality is
Figure BDA00037621211400000511
The image mode searches the sample in the sample set and has the characteristics of
Figure BDA00037621211400000512
The text modal search sample set is characterized by
Figure BDA00037621211400000513
Wherein the content of the first and second substances,
Figure BDA00037621211400000514
representing the number of samples in the search sample set;
the modality specific feature representation and the modality sharing feature representation of the image modality query sample are respectively:
Figure BDA00037621211400000515
and
Figure BDA00037621211400000516
the modal-specific feature representation and the modal-shared feature representation of the text modal query sample are respectively as follows:
Figure BDA0003762121140000061
and
Figure BDA0003762121140000062
the modality specific feature representation and modality shared feature representation of the samples in the image modality retrieval sample set are respectively:
Figure BDA0003762121140000063
and
Figure BDA0003762121140000064
wherein the content of the first and second substances,
Figure BDA0003762121140000065
and
Figure BDA0003762121140000066
the modal specific feature representation and the modal sharing feature representation of the samples in the text modal retrieval sample set are respectively as follows:
Figure BDA0003762121140000067
and
Figure BDA0003762121140000068
wherein the content of the first and second substances,
Figure BDA0003762121140000069
and
Figure BDA00037621211400000610
in one exemplary embodiment, the calculating the euclidean distance between the query sample and each sample in the retrieved sample set comprises the steps of:
query sample for image modalities
Figure BDA00037621211400000611
Computing query samples for image modalities using distance calculation formulas
Figure BDA00037621211400000612
Retrieving each sample in a sample set to a text modality
Figure BDA00037621211400000613
Of (2) is
Figure BDA00037621211400000614
Wherein the content of the first and second substances,
Figure BDA00037621211400000615
representing query samples
Figure BDA00037621211400000616
The combined modality-specific feature representation and modality-shared feature representation of
Figure BDA00037621211400000617
The euclidean distance of the combined modality-specific feature representation and modality-sharing feature representation of (a);
query sample for text modality
Figure BDA00037621211400000618
Computing query samples for text modalities using distance computation formulas
Figure BDA00037621211400000619
Retrieving each sample in a set of samples to an image modality
Figure BDA00037621211400000620
Is a distance of
Figure BDA00037621211400000621
Wherein the content of the first and second substances,
Figure BDA00037621211400000622
representing query samples
Figure BDA00037621211400000623
The combined modality-specific feature representation and modality-shared feature representation of
Figure BDA00037621211400000624
The modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In one exemplary embodiment, the retrieving a query sample using a cross-modality retriever includes the steps of:
for the calculated distance
Figure BDA00037621211400000625
Sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the text retrieval sample set as retrieval results, wherein K is a set query parameter;
for the calculated distance
Figure BDA00037621211400000626
And sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
According to another embodiment of the present invention, there is provided a cross-modal search system based on semantic discrimination, including:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.
The cross-modal retrieval method and the cross-modal retrieval system based on semantic identification have the advantages that:
(1) The mode gap of paired mode sharing features from different modes can be reduced by first learning the mode specific feature representation of each mode using two feed-forward subnetworks, then learning the mode sharing feature representation of each mode using a common subnetwork, and combining the learned mode specific feature representations with the shared feature representation, which facilitates efficient cross-mode retrieval.
(2) The countermeasures are adopted for network training, a generation model in the network is used for learning and predicting semantic labels of the mode specific feature representation and the mode sharing feature representation, modeling similarity between modes and similarity in the modes based on label information, and difference between the mode specific feature representation and the mode sharing feature representation is ensured, so that the learned features have semantic distinctiveness between the modes and in the modes, and complementary information of multi-mode data can be effectively mined.
(3) And performing network training by adopting a countermeasure mechanism, wherein the authentication model in the network is used for learning the modal information of the shared characteristic of the authentication modes so as to improve the modal invariance.
(4) The mode classifier of the sub-network with two layers is used as a countermeasure establishing mode classifier, the mode information in the unknown mode sharing feature representation is identified, and the difference between the modes can be effectively reduced.
Drawings
FIG. 1 is a flow chart of a cross-modal retrieval method based on semantic identification according to an embodiment of the present invention;
FIG. 2 is a flow chart of substep S01 of an embodiment of the present invention;
FIG. 3 is a flow chart of substep S02 of an embodiment of the present invention;
FIG. 4 is a flow chart of substep S03 of an embodiment of the present invention;
FIG. 5 is a flow chart of sub-step S031 of an embodiment of the present invention;
FIG. 6 is a flowchart of sub-step S032 according to an embodiment of the present invention;
fig. 7 is a flow chart of sub-step S033 of an embodiment of the present invention;
FIG. 8 is a flow chart of substep S04 of an embodiment of the present invention;
FIG. 9 is a flow chart of substep S05 of an embodiment of the present invention;
FIG. 10 is a flowchart of substep S06 of an embodiment of the present invention;
FIG. 11 is a flow chart of substep S07 of an embodiment of the present invention;
FIG. 12 is a structural diagram of a cross-modal search system based on semantic identification according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art to further understand the invention, but are not intended to limit the invention in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention provides a cross-modal retrieval method based on semantic identification, which is shown in a flow chart of figure 1 and comprises the following steps:
s01, acquiring an image feature space, a text feature space and a semantic category label;
s02, establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
s03, establishing a generation model according to the semantic similarity between the modalities and in the modalities;
s04, constructing an identification model of an image mode and a text mode;
s05, training a network to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set according to a countermeasure mechanism between the generation model and the identification model;
s06, calculating Euclidean distances from the query sample to each sample in the retrieval sample set;
and S07, retrieving the query sample by using a cross-modal retriever.
In an exemplary embodiment, the step S01, shown in fig. 2, includes:
step S011, acquiring N image mode sample characteristics of a training data set as I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature data set I n Is the nth sample feature in the text modality sample feature data set T;
step S012, determining image mode characteristic and text mode characteristic example set
Figure BDA0003762121140000091
Each instance of which o n =(i n ,t n ) Includes an image mode feature vector
Figure BDA0003762121140000092
And a text modal feature vector
Figure BDA0003762121140000093
d i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t
Step S013, determining image mode characteristic data set
Figure BDA0003762121140000094
And text modal feature data set
Figure BDA0003762121140000095
Step S014, defining semantic category label l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, wherein C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise l nc =0。
In an exemplary embodiment, the step S02, as shown in fig. 3, includes:
step S021, adopting two feedforward sub-networks to learn the mode specific feature representation of the image mode and the text mode, namely, respectively using the sub-networks with three layers to learn the mode specific feature representation of the image mode and the text mode in the image mode and the text mode
Figure BDA0003762121140000096
And
Figure BDA0003762121140000097
wherein f is I (I;θ I ) And f T (T;θ T ) Mapping function, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
step S022, using a common sub-network to learn the mode sharing feature representation of each mode, i.e. using a shared two-layer sub-network f And T f Feature representation mapping to a shared space for learning modality sharing
Figure BDA0003762121140000098
And
Figure BDA0003762121140000099
wherein
Figure BDA00037621211400000910
And
Figure BDA00037621211400000911
in order to share the mapping function(s),
Figure BDA0003762121140000101
for parameters of a shared two-layer subnetwork, d f =d s
In the embodiment, the three fully-connected layers with the dimensionality [1024,512,128 ] are respectively used in the image modality and the text modality]Three-layer sub-network learning of modality-specific feature representations of image modality and text modality
Figure BDA0003762121140000102
And
Figure BDA0003762121140000103
wherein f is I (I;θ I ) And f T (T;θ T ) Mapping functions, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively.
With shared layers made up of two fully-connected layers and having dimensions [128,128 ]]Two-layer sub-network of I f And T f Feature representation mapping to a shared space for learning modality sharing
Figure BDA0003762121140000104
And
Figure BDA0003762121140000105
wherein
Figure BDA0003762121140000106
And
Figure BDA0003762121140000107
in order to share the mapping function(s),
Figure BDA0003762121140000108
for parameters of a shared two-layer subnetwork, d f =d s
In an exemplary embodiment, the step S03, shown in fig. 4, includes:
step S031, label prediction is carried out by using a forward single-layer sub-network as a classifier;
step S032, performing inter-modal and intra-modal semantic similarity modeling based on depth measurement learning;
step S033, distinguishing the modality specific feature representation and the corresponding modality shared feature representation, and calculating a total loss function of the generative model.
In an exemplary embodiment, the sub-step S031, whose flow chart is shown in fig. 5, includes:
step S0311, using the forward single-layer subnetwork activated by Softmax as a classifier, such that the input is
Figure BDA0003762121140000109
Or
Figure BDA00037621211400001010
Time, output the corresponding probability distribution of the semantic class
Figure BDA00037621211400001011
Or
Figure BDA00037621211400001012
Step S0312, according to the probability distribution, construct the semantic discrimination loss function as
Figure BDA00037621211400001013
Where upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.
In this embodiment, in order to make the generated features semantically discriminative, a forward single-layer subnetwork activated by Softmax is used as a classifier, so that the input is
Figure BDA0003762121140000111
Or
Figure BDA0003762121140000112
When it is, canOutputting corresponding probability distributions for semantic categories
Figure BDA0003762121140000113
Or
Figure BDA0003762121140000114
From the probability distribution, the present embodiment defines the semantic discrimination loss as shown in equation (1):
Figure BDA0003762121140000115
where υ represents the parameters of the classifier and U represents the number of instances in each small batch. In the embodiment of the present invention, the dimension of the single-layer network is equal to the number of semantic categories.
In an exemplary embodiment, the step of judging whether the sample semantics are related between the modalities is to calculate a semantic relevance value according to the number of semantic relevant words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects and judge the semantic relevance of the samples according to the fact that the semantic relevance value is larger than a preset semantic relevance threshold.
The number of semantic associated words refers to any item of the number or the proportion of semantically associated texts, and is represented by a variable p.
The similarity degree of the semantic keywords is expressed by a variable q according to the ratio of the number of the same semantic keywords to the total number of the semantic keywords or any item of the influence coefficient of the same semantic keywords on the semantics.
The semantic object similarity is expressed by a variable w according to a ratio of the number of the same semantic objects to the total number of the semantic objects or any item of influence coefficients of the same semantic objects on semantics.
The semantic relevance value is calculated according to the number of semantic associated words and/or the similarity degree of semantic keywords and/or the similarity degree of semantic objects, and the method comprises the following steps: calculating a semantic association value according to a positive correlation of the number of semantic associated words and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic keywords and the semantic association value, calculating a semantic association value according to a positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the number of semantic associated words and the positive correlation of the similarity of semantic objects and the semantic association value, calculating a semantic association value according to the similarity of semantic keywords and the positive correlation of the similarity of semantic objects and the semantic association value, calculating any one of the semantic association values according to the number of semantic associated words and the positive correlation of the similarity of semantic keywords and the similarity of semantic objects and the semantic association value, and expressing the semantic association value by using a variable z.
A1-A7 in the table A represent different implementation modes for calculating the semantic relevance value, wherein the number p of semantic relevant words, the similarity degree q of semantic keywords and the similarity degree w of semantic objects related in the table A are obtained by adopting formulas in the implementation modes.
Table a different embodiment for calculating a semantic relevance value
Figure BDA0003762121140000121
Figure BDA0003762121140000131
Figure BDA0003762121140000141
Figure BDA0003762121140000151
Figure BDA0003762121140000161
Figure BDA0003762121140000171
In this embodiment, a preset semantic correlation threshold Z =0.7, and a semantic relevance value Z (e.g., A7) > Z of a modal sample is calculated according to any one of table a, so that the semantic correlation of the sample between modalities is determined.
In an exemplary embodiment, the sub-step S032, shown in fig. 6, includes the following steps:
step S0321, calculate the Euclidean distance between features, i.e. for each pair of instances o u And o v The distance between its features is defined as
Figure BDA0003762121140000181
Step S03222, establishing a contrast loss function
Figure BDA0003762121140000182
Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each minibatch having different semantic labels, | E | and D denote the sizes of the set E and the set D, respectively, and τ is a positive threshold.
In this embodiment, based on the idea of depth metric learning, the similarity of features is measured by calculating the euclidean distance between the features, and it is required to enhance the similarity of features having the same semantic category and reduce the similarity of features having different semantic categories. For each pair of examples o u And o v The distance between features is defined as:
Figure BDA0003762121140000183
d c (u, v) not only describes intra-modal distances between modality-specific feature representations, but also characterizes modalities between modality-shared feature representationsDistance between each other
Establishing a contrast loss function as shown in equation (3):
Figure BDA0003762121140000184
where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having the same semantic label, D = { (u, v) } is a pair-wise index set of feature pairs in each mini-batch having different semantic labels, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold. In the embodiment of the present invention, a grid search strategy is used to adjust the hyper-parameter τ to have a search range of [1,10] and a step length of 1, and specifically, it is set to τ =6.
Contrast loss function of equation (3)
Figure BDA0003762121140000185
For distance d c (u, v) constraints are imposed such that the feature distance of pairs of features within a class is less than a threshold τ and the feature distance of pairs of features between classes is greater than τ, thereby facilitating the learning of discriminative features.
In an exemplary embodiment, the step S033 is a flowchart as shown in fig. 7, including:
step S0331, using a large interval loss function
Figure BDA0003762121140000191
Distinguishing modality-specific feature representations from corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, wherein h (x) = max (0, x), ζ is a positive threshold;
step S0332, large interval loss function
Figure BDA0003762121140000192
Using threshold ζ versus distance
Figure BDA0003762121140000193
And
Figure BDA0003762121140000194
applying constraints, requiring distances
Figure BDA0003762121140000195
And
Figure BDA0003762121140000196
greater than ζ, distinguishing modality-specific feature representations from modality-shared feature representations;
step S0333, obtaining a total loss function of the generated model by combining the semantic discrimination loss function, the contrast loss function and the large-interval loss function
Figure BDA0003762121140000197
Wherein alpha and beta are balance factors for balancing
Figure BDA0003762121140000198
L in (1) sd 、L c And L lm
In this embodiment, there should be a difference between the modality-specific feature representation and the corresponding modality-shared feature representation for a particular image or text. The embodiment of the invention is designed to use the large interval loss shown as the formula (4) to distinguish the mode specific feature representation and the corresponding mode sharing feature representation so as to effectively learn complementary information in different mode data:
Figure BDA0003762121140000199
where h (x) = max (0, x), ζ is a positive threshold. In the embodiment of the invention, the grid search strategy is used for adjusting the searching range of the over-parameter zeta to be 1,10]Step size is 1, specifically it is set to ζ =5; distance between two adjacent plates
Figure BDA00037621211400001910
And
Figure BDA00037621211400001911
are respectively defined as:
Figure BDA00037621211400001912
and
Figure BDA00037621211400001913
large gap loss of the above formula (4)
Figure BDA00037621211400001914
Using threshold value ζ vs. distance
Figure BDA00037621211400001915
And
Figure BDA00037621211400001916
applying constraints, requiring distances
Figure BDA0003762121140000201
And
Figure BDA0003762121140000202
larger than ζ, act as a feature component discriminator, i.e. distinguish a modality-specific feature representation from a modality-shared feature representation;
combining the semantic discrimination loss (equation 1), the contrast loss (equation 3) and the large interval loss (equation 4) to obtain a total loss function of the generated model:
Figure BDA0003762121140000203
wherein alpha and beta are balance factors for balancing
Figure BDA0003762121140000204
L in (1) sd 、L c And L lm These three terms. In the embodiment of the invention, a grid search strategy is used for adjusting balance factors alpha and beta, and the search range of alpha and beta is [0.01,100%]The step size is a multiple of 10, specifically, α =0.1 and β =0.1 are set.
In an exemplary embodiment, the step S04, shown in fig. 8, includes:
step S041, constructing a mode classifier by using a sub-network with two layers and using the mode classifier as a competitor;
step S042, distributing an independent modal label vector for each item in each instance;
step S043, constructing a resistance loss function as follows:
Figure BDA0003762121140000205
wherein the content of the first and second substances,
Figure BDA0003762121140000206
is a generated feature representation
Figure BDA0003762121140000207
The probability of the mode shape of (a),
Figure BDA0003762121140000208
is a generated feature representation
Figure BDA0003762121140000209
Modal probability of θ A Parameter, g, representing a modality classifier u Is example o u Each item i in u Or t u The true modality tag of (1).
In this embodiment, to reduce the difference between modalities, the present invention uses a sub-network with two layers to construct a modality classifier as a competitor, and the goal of the constructed modality classifier is to identify modality information in the unknown modality sharing feature representation. In the embodiment of the invention, a modal classifier is constructed by using a two-layer network which has the dimension of [64,2] and is activated by a ReLU function and is used as a competitor, and a Softmax activation function is used behind the last layer;
assigning an independent modality label vector to each item in each instance to indicate whether it belongs to an image modality or a text modality;
constructing the resistance loss function is shown in equation (6):
Figure BDA0003762121140000211
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003762121140000212
is a generated feature representation
Figure BDA0003762121140000213
The probability of the mode shape of (a),
Figure BDA0003762121140000214
is the generated feature representation
Figure BDA0003762121140000215
Modal probability of (a), theta A Parameters representing a modality classifier, g u Is example o u Each item (i) in (b) u Or t u ) The real modality tag of (1).
Using the penalty function L of equation (6) above advA ) The heterogeneous difference between the modes can be effectively reduced.
In an exemplary embodiment, the step S05, as shown in fig. 9, includes:
step S051, obtaining the optimal feature representation through the loss function of the combined minimization generation model and the identification model, namely, adopting the min-max game to carry out the following two concurrent sub-processes
Figure BDA0003762121140000217
And
Figure BDA0003762121140000218
optimizing;
step S052, recording the feature vector of a query sample of the image mode as
Figure BDA0003762121140000219
The feature vector of a query sample of the text modality is noted as
Figure BDA00037621211400002110
The image mode searches the sample in the sample set and has the characteristics of
Figure BDA00037621211400002111
The text modal search sample set is characterized by
Figure BDA00037621211400002112
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037621211400002113
representing the number of samples in the search sample set;
step S053, the modality specific feature representation and the modality shared feature representation of the image modality query sample are respectively:
Figure BDA00037621211400002114
and
Figure BDA00037621211400002115
the modal-specific feature representation and the modal-shared feature representation of the text modal query sample are respectively as follows:
Figure BDA00037621211400002116
and
Figure BDA00037621211400002117
step S054, the modality specific feature representation and the modality sharing feature representation of the image modality retrieval sample set are respectively:
Figure BDA00037621211400002118
and
Figure BDA00037621211400002119
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037621211400002120
and
Figure BDA00037621211400002121
the modal specific feature representation and modal shared feature representation of the samples in the text modal search sample set are respectively as follows:
Figure BDA0003762121140000221
and
Figure BDA0003762121140000222
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003762121140000223
and
Figure BDA0003762121140000224
in this embodiment, the optimal feature representation is obtained by jointly minimizing the loss functions of the generative model and the identification model, and since the optimization objectives of the generative model and the identification model are opposite, the method adopts the min-max game to optimize the following two concurrent sub-processes, which are respectively expressed by the following equations (7) and (8):
Figure BDA0003762121140000225
Figure BDA0003762121140000226
the min-max game described above is implemented using a random gradient descent algorithm. For better min-max optimization, the embodiment of the present invention adds a Gradient inversion Layer (GRL) before the first Layer of the modality classifier, and the embodiment of the present invention sets the Batch Size (Batch Size) of the data set to 128 during the training process.
Assume that a query sample of an image modality has a feature vector of
Figure BDA0003762121140000227
Image modality retrieval sample setIs characterized by
Figure BDA0003762121140000228
Wherein the content of the first and second substances,
Figure BDA0003762121140000229
representing the number of samples in a search sample set
Assume that a query sample of text modalities has a feature vector of
Figure BDA00037621211400002210
The text modal search sample set is characterized by
Figure BDA00037621211400002211
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037621211400002212
representing the number of samples in the search sample set;
the modality-specific feature representation and the modality-shared feature representation of the image modality query sample are respectively:
Figure BDA00037621211400002213
and
Figure BDA00037621211400002214
the modality specific feature representation and the modality sharing feature representation of the text modality query sample are respectively as follows:
Figure BDA00037621211400002215
and
Figure BDA00037621211400002216
the modality specific feature representation and modality shared feature representation of the samples in the image modality retrieval sample set are respectively:
Figure BDA00037621211400002217
and
Figure BDA00037621211400002218
wherein the content of the first and second substances,
Figure BDA00037621211400002219
and
Figure BDA00037621211400002220
the modality specific feature representation and the modality sharing feature representation of the samples in the text modality retrieval sample set are respectively as follows:
Figure BDA0003762121140000231
and
Figure BDA0003762121140000232
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003762121140000233
and
Figure BDA0003762121140000234
in an exemplary embodiment, the step S06, shown in fig. 10, includes:
step S061, query sample for image modality
Figure BDA0003762121140000235
Computing query samples for image modalities using distance computation formulas
Figure BDA0003762121140000236
Retrieving each sample in a sample set to a text modality
Figure BDA0003762121140000237
Is a distance of
Figure BDA0003762121140000238
Wherein the content of the first and second substances,
Figure BDA0003762121140000239
representing query samples
Figure BDA00037621211400002310
The combined modality-specific feature representation and modality-shared feature representation of
Figure BDA00037621211400002311
The euclidean distance of the combined modality-specific feature representation and modality-sharing feature representation of (a);
step S062, query sample for text mode
Figure BDA00037621211400002312
Computing query samples for text modalities using distance computation formulas
Figure BDA00037621211400002313
Retrieving each sample in a set of samples to an image modality
Figure BDA00037621211400002314
Of (2) is
Figure BDA00037621211400002315
Wherein the content of the first and second substances,
Figure BDA00037621211400002316
representing query samples
Figure BDA00037621211400002317
By a modality-specific feature representation and a modality-shared feature representation of the user
Figure BDA00037621211400002318
The modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In this embodiment, query samples for image modalities
Figure BDA00037621211400002319
Computing query samples for image modalities using distance calculation formulas
Figure BDA00037621211400002320
Retrieving each sample in a sample set to a text modality
Figure BDA00037621211400002321
Is a distance of
Figure BDA00037621211400002322
Wherein the content of the first and second substances,
Figure BDA00037621211400002323
representing query samples
Figure BDA00037621211400002324
The combined modality-specific feature representation and modality-shared feature representation of
Figure BDA00037621211400002325
The modality-specific feature representation and the modality-sharing feature representation of (2) represent the euclidean distance of the combined features. Query sample for text modality
Figure BDA00037621211400002326
Computing query samples for text modalities using distance computation formulas
Figure BDA00037621211400002327
Retrieving each sample in a set of samples to an image modality
Figure BDA00037621211400002328
Of (2) is
Figure BDA00037621211400002329
Wherein the content of the first and second substances,
Figure BDA00037621211400002330
representing query samples
Figure BDA00037621211400002331
Modality-specific feature representation and modality sharing features ofCharacterizing features and samples after union
Figure BDA00037621211400002332
The modality-specific feature representation and the modality-sharing feature representation of (2) are combined to form a euclidean distance of the feature.
In an exemplary embodiment, the step S07, shown in fig. 11, includes:
step S071 of calculating the obtained distance
Figure BDA0003762121140000241
Sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in a text retrieval sample set as retrieval results, wherein K is a set query parameter;
step S072, the calculated distance
Figure BDA0003762121140000242
And sequencing according to the sequence from small to large, and taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
In this embodiment, the calculated distance is compared
Figure BDA0003762121140000243
Sorting according to the sequence from small to large, and then taking samples corresponding to the front K minimum distances in a text retrieval sample set as retrieval results; for the calculated distance
Figure BDA0003762121140000244
And sorting according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.
A cross-modal retrieval system based on semantic identification according to an embodiment of the present invention is shown in fig. 12, and includes:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.
Of course, those skilled in the art should realize that the above embodiments are only used for illustrating the present invention, and not as a limitation of the present invention, and that changes and modifications to the above embodiments are within the scope of the present invention.

Claims (10)

1. A cross-modal retrieval method based on semantic identification is characterized by comprising the following steps:
acquiring an image feature space, a text feature space and a semantic category label;
establishing modality specific feature representation and shared feature representation of an image modality and a text modality;
establishing a generation model according to the semantic similarity between the modalities and in the modalities;
constructing an identification model of an image mode and a text mode;
training a network according to a countermeasure mechanism between the generation model and the identification model to obtain the characteristics of the query sample and the characteristics of the sample in the retrieval sample set;
calculating the Euclidean distance from the query sample to each sample in the retrieval sample set;
a query sample is retrieved using a cross-modality retriever.
2. The semantic identification-based cross-modal retrieval method of claim 1, wherein the obtaining of the image feature space, the text feature space and the semantic class labels comprises the steps of:
the N image mode sample characteristics for obtaining the training data set are I = [ I = 1 ,...,i n ,...,i N ]N text mode samples are characterized by T = [ T ] 1 ,...,t n ,...,t N ]Wherein i n Is the nth sample feature, t, in the image modality sample feature dataset I n Is the first in the text modal sample feature data set Tn sample features;
determining a set of image modality features and text modality feature instances
Figure FDA0003762121130000011
Wherein each instance o n =(i n ,t n ) Includes an image mode feature vector
Figure FDA0003762121130000012
And a text modal feature vector
Figure FDA0003762121130000013
d i And d t Respectively representing characteristic dimensions of an image modality and a text modality, and d i ≠d t
Determining an image modality feature dataset
Figure FDA0003762121130000014
And text modal feature data set
Figure FDA0003762121130000015
Defining semantic class labels l n =[l n1 ,...,l nc ,...,l nC ] T Represents the class of the nth sample, where C represents the total number of sample classes, and if the nth sample belongs to class C, then l nc =1, otherwise l nc =0。
3. The semantic identification-based cross-modal retrieval method of claim 2, wherein the establishing of the modal-specific feature representation and the shared feature representation of the image modality and the text modality comprises the steps of:
learning mode-specific feature representations of image and text modalities using two feed-forward sub-networks, i.e., learning the mode-specific feature representations of the image and text modalities using sub-networks having three layers in the image and text modalities, respectively
Figure FDA0003762121130000021
And
Figure FDA0003762121130000022
wherein f is I (I;θ I ) And f T (T;θ T ) Mapping functions, theta, for image modality and text modality, respectively I And theta T Parameters of three-layer sub-networks of an image mode and a text mode are respectively set;
using a common subnetwork to learn a mode-sharing feature representation for each mode, i.e. using two layers of shared subnetworks f And T f Feature representation mapping to a shared space for learning modality sharing
Figure FDA0003762121130000023
And
Figure FDA0003762121130000024
wherein
Figure FDA0003762121130000025
And
Figure FDA0003762121130000026
in order to share the mapping function(s),
Figure FDA0003762121130000027
for parameters of a shared two-layer subnetwork, d f =d s
4. The semantic identification-based cross-modal search method of claim 3, wherein the modeling of the generation based on the semantic similarity between and within modalities comprises the steps of:
using a forward single-layer subnetwork as a classifier to perform label prediction;
performing inter-modal and intra-modal semantic similarity modeling based on depth metric learning;
the modality-specific feature representations and the corresponding modality-sharing feature representations are distinguished and a total loss function of the generative model is calculated.
5. The semantic discrimination based cross-modal retrieval method of claim 4 wherein the label prediction using a forward single-layer subnetwork as a classifier comprises the steps of:
using a forward single-layer subnetwork activated by Softmax as a classifier, such that the inputs
Figure FDA0003762121130000028
Or
Figure FDA0003762121130000029
Time, output the corresponding probability distribution of the semantic class
Figure FDA00037621211300000210
Or
Figure FDA00037621211300000211
Based on the probability distribution, a semantic discrimination loss function is constructed as
Figure FDA00037621211300000212
Where upsilon represents the parameter of the classifier and U represents the number of instances in each mini-batch.
6. The cross-modal search method based on semantic identification according to claim 5, wherein the depth metric learning based semantic similarity modeling between and within modalities comprises the steps of:
computing the Euclidean distance between features, i.e. o for each pair of instances u And o v The distance between the features is defined as
Figure FDA0003762121130000031
Establishing a contrast loss function
Figure FDA0003762121130000032
Where h (x) = max (0, x) is a change loss function, E = { (u, v) } is a pair-wise index set of feature pairs having the same semantic label in each mini-batch, D = { (u, v) } is a pair-wise index set of feature pairs having different semantic labels in each mini-batch, | E | and | D | represent the sizes of the set E and the set D, respectively, and τ is a positive threshold.
7. The semantic discrimination based cross-modality retrieval method according to claim 6, wherein the distinguishing of modality specific feature representations and corresponding modality shared feature representations and calculating of the overall loss function of the generative model comprises the steps of:
using large interval loss functions
Figure FDA0003762121130000033
To distinguish between modality-specific feature representations and corresponding modality-shared feature representations in order to efficiently learn complementary information in different modality data, where h (x) = max (0, x), ζ is a positive threshold;
large space loss function
Figure FDA0003762121130000034
Using threshold value ζ vs. distance
Figure FDA0003762121130000035
And
Figure FDA0003762121130000036
applying constraints, requiring distances
Figure FDA0003762121130000037
And
Figure FDA0003762121130000038
greater than ζ, representing and modeling the mode-specific featuresState sharing feature representation differentiation;
obtaining a total loss function of the generated model by combining the semantic identification loss function, the contrast loss function and the large interval loss function
Figure FDA0003762121130000039
Wherein alpha and beta are balance factors for balancing
Figure FDA00037621211300000310
L in (1) sd 、L c And L lm
8. The semantic identification-based cross-modal retrieval method of claim 7, wherein the building of the identification model of the image modality and the text modality comprises the following steps:
constructing a modality classifier using a sub-network having two layers and using it as a competitor;
assigning an independent modality label vector to each item in each instance;
the penalty-fighting function was constructed as:
Figure FDA0003762121130000041
wherein the content of the first and second substances,
Figure FDA0003762121130000042
is a generated feature representation
Figure FDA0003762121130000043
The probability of the mode shape of (a),
Figure FDA0003762121130000044
is a generated feature representation
Figure FDA0003762121130000045
Modal probability of θ A Parameters representing a modality classifier, g u Is example o u Each item in u Or t u The true modality tag of (1).
9. The cross-modal search method based on semantic identification according to claim 8, wherein the training of the network to obtain the features of the query sample and the features of the samples in the search sample set according to the countermeasure between the generative model and the identification model comprises the steps of:
obtaining optimal feature representation by minimizing loss functions of generative model and discriminant model jointly, i.e. using min-max game for following two concurrent sub-processes
Figure FDA0003762121130000046
And
Figure FDA0003762121130000047
optimizing;
the feature vector of a query sample of an image modality is noted
Figure FDA0003762121130000048
The feature vector of a query sample of the text modality is noted
Figure FDA0003762121130000049
The image mode searches the sample in the sample set and has the characteristics of
Figure FDA00037621211300000410
The text modal search sample set is characterized by
Figure FDA00037621211300000411
Wherein the content of the first and second substances,
Figure FDA00037621211300000412
representing the number of samples in the search sample set;
the modality specific feature representation and the modality sharing feature representation of the image modality query sample are respectively:
Figure FDA00037621211300000413
and
Figure FDA00037621211300000414
the modality specific feature representation and the modality sharing feature representation of the text modality query sample are respectively as follows:
Figure FDA00037621211300000415
and
Figure FDA00037621211300000416
the modality specific feature representation and the modality sharing feature representation of the samples in the image modality retrieval sample set are respectively as follows:
Figure FDA00037621211300000417
and
Figure FDA00037621211300000418
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00037621211300000419
and
Figure FDA00037621211300000420
the modal specific feature representation and the modal sharing feature representation of the samples in the text modal retrieval sample set are respectively as follows:
Figure FDA0003762121130000051
and
Figure FDA0003762121130000052
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003762121130000053
and
Figure FDA0003762121130000054
10. a cross-modal retrieval system based on semantic discrimination, comprising:
a processor;
a memory;
and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the method of any of claims 1-9.
CN202210875146.5A 2022-07-25 2022-07-25 Cross-modal retrieval method and system based on semantic identification Pending CN115309930A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210875146.5A CN115309930A (en) 2022-07-25 2022-07-25 Cross-modal retrieval method and system based on semantic identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210875146.5A CN115309930A (en) 2022-07-25 2022-07-25 Cross-modal retrieval method and system based on semantic identification

Publications (1)

Publication Number Publication Date
CN115309930A true CN115309930A (en) 2022-11-08

Family

ID=83858050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210875146.5A Pending CN115309930A (en) 2022-07-25 2022-07-25 Cross-modal retrieval method and system based on semantic identification

Country Status (1)

Country Link
CN (1) CN115309930A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468037A (en) * 2023-03-17 2023-07-21 北京深维智讯科技有限公司 NLP-based data processing method and system
CN116821408A (en) * 2023-08-29 2023-09-29 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN117112829A (en) * 2023-10-24 2023-11-24 吉林大学 Medical data cross-modal retrieval method and device and related equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468037A (en) * 2023-03-17 2023-07-21 北京深维智讯科技有限公司 NLP-based data processing method and system
CN116821408A (en) * 2023-08-29 2023-09-29 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN116821408B (en) * 2023-08-29 2023-12-01 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN117112829A (en) * 2023-10-24 2023-11-24 吉林大学 Medical data cross-modal retrieval method and device and related equipment
CN117112829B (en) * 2023-10-24 2024-02-02 吉林大学 Medical data cross-modal retrieval method and device and related equipment

Similar Documents

Publication Publication Date Title
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
JP5749279B2 (en) Join embedding for item association
CN115309930A (en) Cross-modal retrieval method and system based on semantic identification
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
CN108038492A (en) A kind of perceptual term vector and sensibility classification method based on deep learning
CN109858015B (en) Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm
CN108446334B (en) Image retrieval method based on content for unsupervised countermeasure training
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN115221325A (en) Text classification method based on label semantic learning and attention adjustment mechanism
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
Huang et al. Large-scale semantic web image retrieval using bimodal deep learning techniques
Dong et al. Cross-media similarity evaluation for web image retrieval in the wild
CN112115253A (en) Depth text ordering method based on multi-view attention mechanism
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
TWI452477B (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
CN116955650A (en) Information retrieval optimization method and system based on small sample knowledge graph completion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination