CN114048295A - Cross-modal retrieval method and system for data processing - Google Patents

Cross-modal retrieval method and system for data processing Download PDF

Info

Publication number
CN114048295A
CN114048295A CN202111128176.1A CN202111128176A CN114048295A CN 114048295 A CN114048295 A CN 114048295A CN 202111128176 A CN202111128176 A CN 202111128176A CN 114048295 A CN114048295 A CN 114048295A
Authority
CN
China
Prior art keywords
network
subspace
public
representation
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111128176.1A
Other languages
Chinese (zh)
Inventor
冯爱民
王鸿飞
刘学军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202111128176.1A priority Critical patent/CN114048295A/en
Publication of CN114048295A publication Critical patent/CN114048295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a cross-modal retrieval method for data processing, relates to the field of image-text data processing, and can reduce information loss while having strong cross-modal similarity. The invention comprises the following steps: and inputting sample data to be processed into a pre-processing module with a specific mode, extracting features, and obtaining feature vector information. The resulting feature vector information is mapped to a common subspace through modality-specific sub-networks. And mapping the public representation in the public subspace to a semantic subspace through a first network. And constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator. The invention is suitable for mutual retrieval between the image-text fields.

Description

Cross-modal retrieval method and system for data processing
Technical Field
The invention relates to the field of image-text data processing, in particular to a cross-modal retrieval method and a cross-modal retrieval system for data processing.
Background
Cross-modal retrieval refers to retrieving one type of data as a query and another type of related data. Its flexibility to retrieve across different modalities (such as images and text) has been a major concern in both academia and industry. Computing correlations between multimodal data is a core goal for cross-modality retrieval. However, the potential heterogeneity between modalities leads to incompatibilities, so the key to implementing cross-modality retrieval is how to span the heterogeneous gap between different modalities.
A common method to eliminate cross-modal differences is characterization learning, where data from different modalities is transformed into a common subspace where similarity measures can be performed directly by learning a modality-specific transfer function. However, in the existing scheme, only a part of information in the data set is usually focused in the process of learning conversion, and information loss exists in different degrees on the design of an objective function, so that the performance of the model is limited.
Therefore, how to reduce information loss while having strong cross-modal similarity is a subject to be studied.
Disclosure of Invention
Embodiments of the present invention provide a cross-modal retrieval method and system for data processing, which can reduce information loss while having strong cross-modal similarity.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in one aspect, a cross-modal retrieval method for data processing is provided, including:
and inputting the unprocessed images and texts serving as input samples into respective modality-specific preprocessing modules for feature extraction to respectively obtain original high-dimensional image feature vectors and text feature vectors.
The obtained original feature vectors of the two modes are nonlinearly mapped to a common subspace through respective mode-specific sub-networks, a cross-mode information aggregation constraint is provided based on the common subspace, and global information and fine-grained information in a data set are aggregated under the premise that absolute distance and relative distance are considered at the same time.
The common representation in the common subspace is further mapped to a semantic subspace through a first network, and a semantic constraint is proposed based on potential associations between vector representations in the semantic subspace and sample labels.
And further constructing a mode discriminator by the public representation in the public subspace through a second network, and proposing a mode constraint based on the mode discriminator to distinguish the original modes of each public representation, wherein the mode constraint is opposite to the optimization target of the cross-mode information aggregation constraint, and the mode constraint and the cross-mode information aggregation constraint are mutually confronted by means of mode characteristic information to introduce confrontation learning for the model.
In another aspect, a cross-modal retrieval system for data processing is provided, comprising:
the preprocessing module is used for inputting sample data to be processed into the mode-specific preprocessing module, performing feature extraction and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors;
the processing module is used for mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace; mapping the public representation in the public subspace through a first network to obtain a semantic subspace, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to a semantic constraint; constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator; storing the result of distinguishing the original mode of each public representation into a database module;
the database module is used for storing the output modal result of the processing module;
and the query feedback module is used for receiving a query item sent by the terminal equipment, converting the query item into a public representation, then querying a modal result stored in the database to obtain a public representation which is most similar to the converted public representation and comes from another modality, and feeding back the query result to the terminal equipment. The invention adopts double constraint of two subspaces and antagonistic learning to reduce the information loss in the cross-modal process to the maximum extent, and generates the public representation with stronger cross-modal similarity and semantic distinguishability, thereby realizing that the information loss is reduced while the stronger cross-modal similarity is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a possible implementation manner of a cross-modal search model according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a possible implementation manner of the first model according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a possible implementation manner of the second model according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating a method according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The embodiment of the invention provides a cross-modal retrieval method for data processing, in particular to a scheme improvement of a cross-modal retrieval technology, and the main design idea is as follows: as shown in fig. 1, the information loss in the cross-modal process is minimized by adopting the dual constraint and antagonistic learning of two subspaces, and a common representation with stronger cross-modal similarity and semantic distinctiveness is generated. The main method flow is shown in fig. 1 and 4, and comprises the following steps:
and S1, inputting the sample data to be processed into a mode-specific preprocessing module, extracting features, and obtaining feature vector information.
The mode-specific preprocessing module is used for adopting specific different preprocessing models aiming at specific modes. For data of an image modality, preprocessing is carried out by adopting a VGG network, and for data of a text modality, preprocessing is carried out by adopting a BoW network. See figure 2 in particular. The VGG network is a neural network which is specially used for extracting feature information in an image in deep learning, and the effect is very good. BoW is also a common model that deals specifically with text.
The preprocessing module mentioned in this embodiment can be implemented as a code program in practical application, as shown in fig. 2, and we adopt different models to perform feature extraction on original pictures and characters according to different modalities. The VGG network is a neural network specially used for processing pictures in deep learning, has a very good effect, is widely applied to the field of artificial intelligence, and is accepted by people, and the VGG network is taken as a preprocessing module of an image mode. BoW is also a common, widely used model that deals specifically with text, and we consider it as a pre-processing module for the text modality.
The sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors. For example: unprocessed images and texts can be used as input sample pairs and input into respective mode-specific preprocessing modules for feature extraction, so as to respectively obtain original high-dimensional image feature vectors and text feature vectors. It should be noted that there is no specific standard for the height, and the height referred to in this embodiment is generally thousands of dimensions, the low latitude is generally hundreds of dimensions, and the height is relative. The output of both BOW and VGG are three-four-thousand dimensions, while the common subspace dimension is hundreds.
S2, the obtained feature vector information is mapped to a common subspace through a modality-specific sub-network.
And acquiring cross-modal information aggregation constraint based on the common subspace correspondence. For example: the obtained original feature vectors of the two modes can be nonlinearly mapped to a common subspace through respective mode-specific sub-networks, a cross-mode information aggregation constraint is provided based on the common subspace, and global information and fine-grained information in a data set are aggregated under the premise that absolute distance and relative distance are considered at the same time.
Specifically, the cross-modal information aggregation constraint refers to: a common representation based on a common subspace proposes a loss function (also called search loss by some scholars), whose formula is Lr. The part is called cross-modal information aggregation because the retrieval loss of the part aggregates global information and fine-grained information in a data set, the part is more visual and more visual than the retrieval loss, and the retrieval loss is a specific implementation function which is short for name or cross-modal information aggregation constraint. Semantic constraints, which refer to: similar to the above, a semantic loss function is proposed based on vector representation in the semantic subspace, the function formulation being Ls. Modal constraints, meaning: the loss function of the modal discriminator is called modal loss, and the formula is Ld. In actual design, each constraint specifically corresponds to a loss function.
And S3, mapping the common representation in the common subspace to a semantic subspace through a first network.
Wherein a potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to the semantic constraint. For example: the common representation in the common subspace may be further mapped to a semantic subspace through the first network, a semantic constraint being proposed based on potential associations between vector representations in the semantic subspace and sample labels. Specifically, the common subspace is obtained from S2, for example: at s2, the image feature vector (4096 dimensions, for example) is input to the image sub-network, and a 200-dimensional vector is output via the network mapping. The 200-dimensional vector can be regarded as a point in a 200-dimensional space, and the vector is the coordinates of the point. So, via step S2, all feature vectors map to a point in the 200-dimensional space, which is called the common subspace because both the image and the text are in this space, and each point in the space (i.e., each 200-dimensional vector) is called the common representation. First and second networks as shown in fig. 1 and 3, the first network is composed of a fully connected layer, and the output vector can be regarded as a point in a space (the space is called a semantic subspace) according to the reason of the common subspace. The second network consists of three fully connected layers. The modality discriminator is an alternative name that I have worked according to the function of the second network, and its output is not a vector but a value (scalar) that can be used to discriminate the modality.
And S4, constructing a mode discriminator through a second network for the common representation in the common subspace, and distinguishing the original modes of each common representation by using the constructed mode discriminator.
This embodiment works similar to a conventional search engine (e.g., a photo search engine), and when a query term is input, the model converts the query term into a common representation, and then the model searches, for example, a database for a common representation from another modality that is most similar to the common representation, and then returns the corresponding query result to the user. For example, the user enters the word cat and the model returns her pictures of some cats. The user enters a picture and the model returns her some reports or descriptions about this picture. The common representation in the common subspace can further pass through a second network to construct a mode discriminator, a mode constraint is proposed based on the mode discriminator to distinguish the original modes of each common representation, the optimization goal of the mode-crossing information aggregation constraint is opposite, and the mode constraint and the original modes are mutually confronted by means of mode characteristic information to introduce confrontation learning for the model. The invention adopts double constraint of two subspaces and antagonistic learning to reduce information loss in the cross-modal process to the maximum extent, and generates public representation with stronger cross-modal similarity and semantic distinguishability.
In this embodiment, step S1 includes: feature extraction is performed on the image data by a VGG-19 network, where 4096-dimensional vectors output in fc7 layer are obtained as input to the image sub-network in the second model. Processing the text data by a bag of words model (BoW) and generating high-dimensional text feature vectors as input for the text sub-network in the second model. For example: as shown in fig. 2, the unprocessed image and text are used as input samples to be input into respective modality-specific preprocessing modules for feature extraction, and the original high-dimensional image feature vector and text feature vector are obtained as input of a subsequent sub-network. Although it is the first half of the whole cross-modal search model, it does not participate in the training of the whole model, and it is essentially a data preprocessing process. Specifically, for the unprocessed image sample, the present embodiment performs feature extraction using the pre-trained VGG-19 network, obtaining 4096-dimensional vectors output in its fc7 layer as input to the image sub-network in the second model. The unprocessed text sample is passed through the well-known bag of words model (BoW) to generate the original high-dimensional text feature vectors as input to the text sub-network in the second model.
Specifically, in the construction process of the common subspace, the obtained original feature vectors of the two modes can be nonlinearly mapped to the common subspace through respective mode-specific sub-networks. For example: the image sub-network and the text sub-network are respectively composed of three layers of fully-connected neural networks, and the fully-connected neural networks have abundant parameters and have enough capacity in the sub-networks to realize the complex conversion. To better transition from different modes to the same subspace, the present embodiment employs weight sharing at the last level of the two subnetworks. The embodiment provides a cross-modal information aggregation constraint based on a public subspace, for simplicity, the embodiment refers to the constraint as retrieval loss, and aggregates global information and fine-grained information in a data set on the premise of simultaneously considering an absolute distance and a relative distance, so that information loss in a cross-modal process is greatly reduced, and cross-modal similarity of public representation is achieved.
In this embodiment, step S2 includes:
and mapping the obtained image feature vector and the obtained text feature vector to the public subspace in a nonlinear way through respective modality-specific sub-networks, wherein the image sub-network and the text sub-network are respectively composed of three layers of fully-connected neural networks, and a retrieval loss model utilized in the mapping process is composed of three sub-items.
Constructing a first sub-item by using the center loss of the triple, wherein the triple is
Figure BDA0003279449680000081
Wherein t isqIs a term of a text query that,
Figure BDA0003279449680000082
is a positive class center and is associated with a text query term tqThe label is of the same category as the label,
Figure BDA0003279449680000083
is a negative class center and is associated with the text query term tqThe label categories are not the same.
The triad center loss is:
Figure BDA0003279449680000084
wherein N is1Denotes the total number of triplets, m1Represents an adjustable threshold value, i1、i2And i3The representations represent different class centers, respectively, with 1, 2 and 3 in the subscript merely to distinguish that this is a different class center, i indicates that the item belongs to an image modality, and t indicates that the item belongs to a text modality. i is image and t is text.
For example: the specific design mode for searching loss comprises the following steps: the search loss is composed of three sub-items, which are respectively in the form of triples, quadruples and pairs, and aggregates global information and fine-grained information in the data set. Wherein: step 1, constructing a first sub-item, namely a triple center loss, and then explaining by taking a text query image as an example: a triple is defined as
Figure BDA0003279449680000085
Wherein t isqIs a term of a text query that,
Figure BDA0003279449680000086
and
Figure BDA0003279449680000091
are two different image class centers. This embodiment is called
Figure BDA0003279449680000092
Class-one center, which is associated with the text query term tqThe labels are of the same type, this embodiment is called
Figure BDA0003279449680000093
Is a negative class center, which is associated with the text query term tqThe label categories are not the same. The triplet center loss is then defined as follows:
Figure BDA0003279449680000094
wherein N is1Represents the total number of triplets, m1Is an adjustable threshold value, [ A ]]+Max (0, a) is the hinge function, d (a, b) | | | a-b | | non-calculation2Representing the euclidean distance. The above formula states that the distance of a query term to the corresponding positive class center is less than the distance of the query term to any one of the negative class centers. This sub-term takes into account the absolute distance under the same anchor point and introduces global information for the model using the sample to class center relationship.
The second sub-item is constructed by loss of the center of the quadruple, wherein the quadruple is
Figure BDA0003279449680000095
Figure BDA0003279449680000096
Is and
Figure BDA0003279449680000097
different negative class centers, the quadruple center loss is:
Figure BDA0003279449680000098
Figure BDA0003279449680000099
wherein N is2Denotes the total number of quadruplets, m2Is another adjustable threshold. For example: and 2, constructing a second sub item, namely the loss of the quadruple center. A quadrupleFormally defined as
Figure BDA00032794496800000910
Figure BDA00032794496800000911
And
Figure BDA00032794496800000912
are any two different negative class centers. Then, the quad center loss is defined as follows:
Figure BDA00032794496800000913
wherein N is2Denotes the total number of quadruplets, m2Is another adjustable threshold. The above formula states that the distance between a query term to a positive class center is less than the distance between any two different negative class centers. The sub-entry takes into account the relative distances under different anchor points and introduces another part of global information in the data set for the model, as a supplement to the triple center loss.
And (3) establishing constraint on the global level by utilizing the triple center loss and the quadruple center loss:
Figure BDA00032794496800000914
wherein σIAnd σTThe weight parameters of the image sub-network and the text sub-network, respectively. For example: step 3, by combining the two formulas, this embodiment obtains a constraint on a global level, and the formulas are defined as follows:
Figure BDA0003279449680000101
wherein σIAnd σTThe weight parameters of the image sub-network and the text sub-network, respectively. The above formula enables the model to construct a common subspace by using global information, and greatly reduces the difficulty of model update caused by huge difference between sample pairs. All in oneIn this way, the embodiment can obtain the image query text
Figure BDA0003279449680000102
Figure BDA0003279449680000103
Constructing a third sub-item:
Figure BDA0003279449680000104
wherein the content of the first and second substances,
Figure BDA0003279449680000105
e is a matrix of indicators, ijRepresenting the j-th image sample and tkThe kth text sample is represented, j and k respectively represent positive integers and respectively represent, i represents that the item belongs to an image modality, and t represents that the item belongs to a text modality. i is image and t is text. J and k represent sample numbers, respectively. So ijRepresenting the j-th image sample, tkKth text sample, EjkFor representing ijAnd tkWhether the categories of (1) are the same or not, and if so, Ejk1, otherwise Ejk0. For example: step 4, constructing a third sub-item, wherein the sub-item is a constraint on a local level based on a sample pair, and a formula is defined as:
Figure BDA0003279449680000106
wherein the content of the first and second substances,
Figure BDA0003279449680000107
e is an indicator matrix, EjkRepresents ijAnd tkIs the same. If two samples belong to the same class, Ejk1, otherwise Ejk0. Equation 4 makes the cosine values between samples of the same class as large as possible, while the cosine values between samples of different classes are as small as possible.
And constructing a complete retrieval loss according to the constrained third sub-item on the global level:
Figure BDA0003279449680000108
where γ is a hyperparameter. For example: step 5, by combining the formulas of step 3 and step 4, the present embodiment obtains a complete search loss, and the formula is defined as follows:
Figure BDA0003279449680000111
where γ is a hyperparameter. The formula effectively reduces information loss by aggregating constraints on different levels of hierarchy, and cross-modal similarity in common representation.
In this embodiment, in step S3, the first network is formed by a layer of fully-connected neural network, where the semantic constraint is:
Figure BDA0003279449680000112
wherein σsA network parameter indicative of the first network,
Figure BDA0003279449680000113
representing a vector representation in a semantic subspace, do representing the number of classes of samples in a training dataset, RdoA vector space representing the do dimension,
Figure BDA0003279449680000114
a label vector representing the corresponding sample. For example: the common representation in the common subspace may be further mapped to a semantic subspace through a first network, based on which a semantic constraint is proposed that exploits the potential semantic association between the sample labels and the vector representation of the semantic subspace to optimize the distribution of the common subspace. As in fig. 3, the first network consists of a layer of fully-connected neural networks. The semantic constraint formula is expressed as follows:
Figure BDA0003279449680000115
wherein sigmasIs the network parameter of the portion that is,
Figure BDA0003279449680000116
is a vector representation in a semantic subspace,
Figure BDA0003279449680000117
is the label vector of the corresponding sample. The formula not only has semantic discriminability of public representation, but also has a certain regularization effect on the generation process of the public representation.
In this embodiment, in step S4, the mode loss of the mode discriminator is established:
fbce(x,c)=c(x)log(p(fT(tx;σT)))+(1-c(x))log(p(fI(ix;σI)))
Figure BDA0003279449680000118
wherein f isbceRepresenting a binary cross-entropy loss function for modal classification, c (-) represents a modal indicator, c (x) 1 when input x represents text, otherwise c (x) 0, p (-) represents the probability of each mode generated by the input, σdRepresenting the parameters of the arbiter. For example: the common representations in the common subspace may further be passed through a second network to construct a modality discriminator, on the basis of which the present embodiment proposes a modality constraint to distinguish the original modalities of each common representation, with the goal of distinguishing as many as possible of the original modalities of the common representation. However, the cross-modal aggregation constraint aims to generate a common representation with cross-modal similarity, which is contrary to the purpose of the modal discriminator. Therefore, the two games play a maximum and minimum game as competitors. For simplicity, the penalty of the discriminator is referred to as the modal penalty, and the modal penalty formula is defined as follows:
fbce(x,c)=c(x)log(p(fT(tx;σT)))+(1-c(x))log(p(fI(ix;σI)))
Figure BDA0003279449680000121
wherein f isbceIs a two-class cross-entropy loss function for modal classification. c (-) is a modal indicator, c (x) 1 when input x is text, otherwise c (x) 0. p (-) is the probability, σ, of each mode of input generationdIs a parameter of the discriminator. By maximising
Figure BDA0003279449680000122
The performance and robustness of the model are further improved.
Further, in this embodiment, the modal discriminator may be optimized by an Adam algorithm, wherein in the optimization process, the maximum and minimum game is performed by two parallel sub-processes, including:
Figure BDA0003279449680000123
Figure BDA0003279449680000124
specifically, an Adam algorithm may be employed that approximates the true gradient of the mini-batch to update the model. The process of learning the best tokens is to jointly minimize search loss, semantic loss, and modal loss. Since the optimization goals of the search penalty and the modal discriminator are opposite, the process operates as the maximum and minimum game of two parallel sub-processes with the following formula:
Figure BDA0003279449680000131
Figure BDA0003279449680000132
the training of the model is actually a process of continuously optimizing k generation processes and one discrimination process until the result of the whole model converges. As with all antagonistic learning methods, the parameters of the "generator" are fixed during the training phase of the "discriminator" and vice versa. As can be seen from FIG. 3, the method of the present invention can greatly reduce the information loss in the cross-modal process, generate a public representation with stronger cross-modal similarity and semantic discrimination, and effectively improve the precision of cross-modal retrieval.
The advantages of this embodiment are: information loss is greatly reduced through cross-modal information aggregation constraints, and stronger cross-modal similarity is achieved in generating common representations. Semantic constraints utilize semantic discriminative properties of semantic information in the sample labels in a common representation. The modal constraint further reduces information loss by using modal fixed information and enhances the robustness of the model.
It should be noted that this embodiment is not a simple calculation method, but can be applied to a retrieval system and assist in improving a search engine. For example, in practical applications, the method of this embodiment may be applied to a system as shown in fig. 5, including:
the preprocessing module is used for inputting sample data to be processed into the mode-specific preprocessing module, performing feature extraction and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors.
The processing module is used for mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace; mapping the public representation in the public subspace through a first network to obtain a semantic subspace, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to a semantic constraint; constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator; and storing results of distinguishing the original modality of each common representation to the database module.
And the database module is used for storing the output modal result of the processing module.
And the query feedback module is used for receiving a query item sent by the terminal equipment, converting the query item into a public representation, then querying a modal result stored in the database to obtain a public representation which is most similar to the converted public representation and comes from another modality, and feeding back the query result to the terminal equipment.
In particular, the embodiment is suitable for mutual retrieval between the image-text fields, that is, the query term is converted into a common representation through the trained model, and the model returns the query result from another modality by further measuring the similarity between the query term and other common representations. For example: working in a manner similar to search engines commonly used today, when a query term is entered, the model will convert the query term into a common representation first, and then the model will look up the common representation from another modality that is most similar to the common representation in, for example, a database, and then return the corresponding query results.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A cross-modal retrieval method for data processing, comprising:
s1, inputting sample data to be processed into a mode-specific preprocessing module, performing feature extraction, and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors;
s2, mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace;
s3, mapping the public representation in the public subspace to a semantic subspace through a first network, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to semantic constraints;
and S4, constructing a mode discriminator through a second network for the common representation in the common subspace, and distinguishing the original modes of each common representation by using the constructed mode discriminator.
2. The method according to claim 1, further comprising, after step S4:
after receiving a query item sent by a terminal device, converting the query item into a public representation;
querying the modality results stored in the database and obtaining a common representation that is most similar to the converted common representation and is from another modality;
and feeding back the query result to the terminal equipment.
3. The method according to claim 1, wherein step S1 includes:
performing feature extraction on the image data through a VGG-19 network, wherein 4096-dimensional vectors output in an fc7 layer are obtained as input of an image sub-network in the second model;
processing the text data by a bag of words model (BoW) and generating high-dimensional text feature vectors as input for the text sub-network in the second model.
4. The method according to claim 1, wherein step S2 includes:
and mapping the obtained image feature vector and the obtained text feature vector to the public subspace in a nonlinear way through respective modality-specific sub-networks, wherein the image sub-network and the text sub-network are respectively composed of three layers of fully-connected neural networks, and a retrieval loss model utilized in the mapping process is composed of three sub-items.
5. The method of claim 4, further comprising:
constructing a first sub-item by using the center loss of the triple, wherein the triple is
Figure RE-FDA0003446624420000021
Wherein t isqIs a term of a text query that,
Figure RE-FDA0003446624420000022
is a positive class center and is associated with a text query term tqThe label is of the same category as the label,
Figure RE-FDA0003446624420000023
is a negative class center and is associated with the text query term tqThe label categories are different;
the triad center loss is:
Figure RE-FDA0003446624420000024
wherein N is1Denotes the total number of triplets, m1Represents an adjustable threshold value, i1、i2And i3Respectively represent different class centers;
the second sub-item is constructed by loss of the center of the quadruple, wherein the quadruple is
Figure RE-FDA0003446624420000025
Figure RE-FDA0003446624420000026
Is and
Figure RE-FDA0003446624420000027
different negative class centers, the quadruple center loss is:
Figure RE-FDA0003446624420000028
Figure RE-FDA0003446624420000029
wherein N is2Denotes the total number of quadruplets, m2Is another adjustable threshold;
and (3) establishing constraint on the global level by utilizing the triple center loss and the quadruple center loss:
Figure RE-FDA00034466244200000210
wherein σIAnd σTThe weight parameters of the image sub-network and the text sub-network, respectively.
6. The method of claim 5, further comprising:
constructing a third sub-item:
Figure RE-FDA00034466244200000211
wherein the content of the first and second substances,
Figure RE-FDA00034466244200000212
e is a matrix of indicators, ijRepresenting the j-th image sample and tkRespectively represent the kth text sample, j and k respectively represent positive integers, EjkFor representing ijAnd tkWhether the categories of (1) are the same or not, and if so, Ejk1, otherwise Ejk=0;
And constructing a complete retrieval loss according to the constrained third sub-item on the global level:
Figure RE-FDA00034466244200000213
where γ is a hyperparameter.
7. The method according to claim 1, wherein in step S3, the first network is composed of a layer of fully-connected neural networks, wherein the semantic constraints are:
Figure RE-FDA0003446624420000031
wherein σsA network parameter indicative of the first network,
Figure RE-FDA0003446624420000032
representing a vector representation in a semantic subspace, do representing the number of classes of samples in a training dataset, RdoVector space, representing do dimensions
Figure RE-FDA0003446624420000033
A label vector representing the corresponding sample.
8. The method according to claim 1, wherein in step S4, a modal loss of the modal discriminator is established:
fbce(x,c)=c(x)log(p(fT(tx;σT)))+(1-c(x))log(p(fI(ix;σI)))
Figure RE-FDA0003446624420000034
wherein f isbceRepresenting a binary cross-entropy loss function for modality classification, c (-) represents a modality indicator, c (x) ═ 1 when input x represents text, otherwise c (x) ═ 0, p (-) representsInputting the probability, σ, of each modal state generateddRepresenting the parameters of the arbiter.
9. The method of claim 8, further comprising:
optimizing the modal discriminator by an Adam algorithm, wherein in the optimization process, the maximum and minimum game is carried out by two parallel sub-processes, comprising:
Figure RE-FDA0003446624420000035
Figure RE-FDA0003446624420000036
10. a cross-modal retrieval system for data processing, comprising:
the preprocessing module is used for inputting sample data to be processed into the mode-specific preprocessing module, performing feature extraction and obtaining feature vector information, wherein the sample data to be processed comprises a sample pair consisting of image data and text data, and the obtained feature vector information comprises: image feature vectors and text feature vectors;
the processing module is used for mapping the obtained feature vector information to a public subspace through a mode-specific sub-network, wherein cross-mode information aggregation constraint is obtained on the basis of the public subspace; mapping the public representation in the public subspace through a first network to obtain a semantic subspace, wherein potential association is established between the vector representation in the semantic subspace and the sample label, and the potential association corresponds to a semantic constraint; constructing a mode discriminator for the public representation in the public subspace through a second network, and distinguishing the original mode of each public representation by using the constructed mode discriminator; storing the result of distinguishing the original mode of each public representation into a database module;
the database module is used for storing the output modal result of the processing module;
and the query feedback module is used for receiving a query item sent by the terminal equipment, converting the query item into a public representation, then querying a modal result stored in the database to obtain a public representation which is most similar to the converted public representation and comes from another modality, and feeding back the query result to the terminal equipment.
CN202111128176.1A 2021-09-26 2021-09-26 Cross-modal retrieval method and system for data processing Pending CN114048295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111128176.1A CN114048295A (en) 2021-09-26 2021-09-26 Cross-modal retrieval method and system for data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111128176.1A CN114048295A (en) 2021-09-26 2021-09-26 Cross-modal retrieval method and system for data processing

Publications (1)

Publication Number Publication Date
CN114048295A true CN114048295A (en) 2022-02-15

Family

ID=80204797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111128176.1A Pending CN114048295A (en) 2021-09-26 2021-09-26 Cross-modal retrieval method and system for data processing

Country Status (1)

Country Link
CN (1) CN114048295A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186119A (en) * 2022-09-07 2022-10-14 深圳市华曦达科技股份有限公司 Picture processing method and system based on picture and text combination and readable storage medium
CN115422317A (en) * 2022-11-04 2022-12-02 武汉大学 Semantic tag constrained geographic information retrieval intention formalized expression method
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN116821408A (en) * 2023-08-29 2023-09-29 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN115186119A (en) * 2022-09-07 2022-10-14 深圳市华曦达科技股份有限公司 Picture processing method and system based on picture and text combination and readable storage medium
CN115422317A (en) * 2022-11-04 2022-12-02 武汉大学 Semantic tag constrained geographic information retrieval intention formalized expression method
CN116821408A (en) * 2023-08-29 2023-09-29 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN116821408B (en) * 2023-08-29 2023-12-01 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system

Similar Documents

Publication Publication Date Title
CN110222140B (en) Cross-modal retrieval method based on counterstudy and asymmetric hash
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
CN108492200B (en) User attribute inference method and device based on convolutional neural network
Xie et al. Representation learning of knowledge graphs with entity descriptions
CN114048295A (en) Cross-modal retrieval method and system for data processing
CN110309331A (en) A kind of cross-module state depth Hash search method based on self-supervisory
Lin et al. Multilabel aerial image classification with a concept attention graph neural network
CN112434628B (en) Small sample image classification method based on active learning and collaborative representation
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
Zhang et al. Representation learning of knowledge graphs with entity attributes
CN109308316A (en) A kind of adaptive dialog generation system based on Subject Clustering
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
Li et al. Inner knowledge-based Img2Doc scheme for visual question answering
Wang et al. Positive unlabeled fake news detection via multi-modal masked transformer network
Wang et al. Sin: Semantic inference network for few-shot streaming label learning
Xue et al. Relation-based multi-type aware knowledge graph embedding
Cheng et al. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
Cheng et al. Knowledge graph representation learning with multi-scale capsule-based embedding model incorporating entity descriptions
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
Xiao et al. Triple alliance prototype orthotist network for long-tailed multi-label text classification
CN112364160A (en) Patent text classification method combining ALBERT and BiGRU
CN116662478A (en) Multi-hop retrieval method and system based on knowledge graph embedding and path information
Yu et al. Multi-module Fusion Relevance Attention Network for Multi-label Text Classification.
CN116089644A (en) Event detection method integrating multi-mode features
Wang et al. Generalised zero-shot learning for entailment-based text classification with external knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination