CN115563316A - Cross-modal retrieval method and retrieval system - Google Patents

Cross-modal retrieval method and retrieval system Download PDF

Info

Publication number
CN115563316A
CN115563316A CN202211322568.6A CN202211322568A CN115563316A CN 115563316 A CN115563316 A CN 115563316A CN 202211322568 A CN202211322568 A CN 202211322568A CN 115563316 A CN115563316 A CN 115563316A
Authority
CN
China
Prior art keywords
modal
original
alignment
cross
modality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211322568.6A
Other languages
Chinese (zh)
Inventor
强保华
孙苹苹
杨先一
席广勇
陈锐东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202211322568.6A priority Critical patent/CN115563316A/en
Publication of CN115563316A publication Critical patent/CN115563316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-modal retrieval method and a retrieval system, wherein the retrieval method comprises the following steps: coding the features by adopting a CLIP pre-training model to obtain original modal features comprising an original image and a text; performing attention alignment processing on the original modal characteristics to obtain modal alignment data so as to realize semantic correlation between the original modalities; the modal data formed in the above steps pass through a multilayer perceptron shared by weight to keep the invariance of the modality; and distributing the finally obtained characteristic data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint. The cross-modal retrieval method of the invention enables the public representation of the paired images and texts to be as close as possible, and realizes the enhancement of the intra-class compactness and the inter-class difference at the same time.

Description

Cross-modal retrieval method and retrieval system
Technical Field
The invention relates to the field of cross-modal retrieval of semantic maximum correlation and modal alignment, in particular to a cross-modal retrieval method and a cross-modal retrieval system.
Background
Information resources have presented a hybrid situation of multimodal data (text, images, audio, video, etc.), which are cross-linked and progressively fused in depth, and which present a rapidly growing trend. How to mine hidden semantic association among cross-modal data and realize cross-modal information retrieval is an important premise for fully utilizing multi-modal data resources.
With the continuous increase of data scale and model scale, deep learning gradually enters the pre-training model era, and how to better apply the pre-training model era to downstream tasks is receiving more and more attention, such as CLIP, simVLM and the like. The existing text Image reasoning capability of the pre-training model has relatively good migration capability for different downstream tasks such as Image description (Image capturing), visual Question Answering (VQA), cross-Modal Retrieval (Cross-Modal) and the like. Compared with the traditional image classification method, the CLIP model does not allocate a noun label to each image any more, but allocates a sentence, so that the CLIP model is forcedly divided into similar images in the past, and has labels with infinite fine granularity. Although the pre-trained model CLIP through unsupervised contrast learning method via 400 billion pairs of image text pairs has acquired rich text-image semantics, CLIP is still independent of the previous encoding stages of the two-modality data and still lacks interaction of information between modalities. The CLIP uses contrast loss constraint to judge whether two modes are matched or not, each piece of image (text) mode information has only one piece of text (image) mode information matched with the image (text) mode information, and abundant semantic information and discrimination information in the modes and between the modes contained in one-to-many approximate matching conditions are ignored.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
In view of the above, the invention discloses a novel cross-modal retrieval method, which comprises the steps of firstly re-representing the feature representation of one mode by another mode through a Decomposable Attention mechanism, acquiring richer semantic information, simultaneously enhancing the semantic association of the two modes, then distributing the learned multi-modal features to a normalized hypersphere by utilizing an Arc4cmr loss function in the aspect of label space, and adding an angular edge penalty between the features and a weight value to enable clear decision boundaries to exist among classes, thereby simultaneously enhancing the class compactness and the class difference.
Specifically, the invention is realized by the following technical scheme:
in a first aspect, the invention discloses a novel cross-modal retrieval method, which comprises the following steps:
coding the characteristics of the image and text samples by adopting a CLIP pre-training model to obtain the original modal characteristics comprising original images and texts;
performing attention alignment processing on the original modal characteristics to obtain modal alignment data so as to realize semantic correlation between the original modalities;
keeping the modal invariance of the modal alignment data formed in the previous step through a weight-sharing multilayer perceptron;
and distributing the obtained modal data to a normalized hypersphere by using an Arc4cmr loss function to carry out class boundary constraint.
In a second aspect, the present invention discloses a cross-modal search system, including:
an initial module: the method comprises the steps of coding the characteristics of an image and a text sample by adopting a CLIP pre-training model to obtain original mode characteristics comprising an original image and a text;
an alignment module: the original modal characteristics are subjected to attention alignment processing to obtain modal alignment data so as to realize semantic correlation between original modalities;
the weight sharing module: the multi-layer perceptron is used for sharing the modal alignment data formed in the previous step through weight so as to keep the invariance of the modality;
a normalization module: and distributing the obtained final modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.
In a third aspect, the invention discloses a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the cross-modal retrieval method according to the first aspect.
In a fourth aspect, the present invention discloses a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the cross-modal retrieval method according to the first aspect.
In the cross-modal retrieval in the prior art, the problem of heterogeneity difference is solved by measuring similarity of different modal data related to bottom-layer characteristic heterogeneity and high-layer semantics, and the overall method can be divided into unsupervised retrieval and supervised retrieval.
Unsupervised cross-modal retrieval: the typical Correlation Analysis (CCA) is essentially a multivariate statistical Analysis, which obtains an unsupervised common subspace with the largest pairwise similarity by using the Correlation between a plurality of image and text matching pairs, and maps the image features and the text features into the common subspace to obtain a uniform characterization of data in different modalities, reflecting the overall Correlation between the two modalities, thereby implementing cross-modality retrieval. Kernel correlation analysis (Kernel CCA, KCCA) introduces Kernel skills to improve the situation in which CCA may fail for the presence of a non-linearly correlated relationship between two variables. A correlated auto-encoder (Corr-AE) takes into account reconstruction errors and associated losses in cross-modal retrieval with an auto-encoder.
Supervised cross-modality retrieval: joint Representation Learning (JRL) integrates sparse and semi-supervised regularization of different media types in a unified framework to jointly explore pairwise correlation and semantic information. The antagonistic cross-modal retrieval (ACMR) attempts to distinguish different modalities by classifying the idea of antagonistic learning. Cross-modal correlation learning (CCL) mines the coarse and fine granularity information of different media type data in a multi-task learning mode. A Deep Supervised Cross-Modal Retrieval method (DSCMR) keeps semantic distinctiveness by linearly classifying samples in a common representation space and keeps Modal invariance in the common representation space through a weight sharing strategy. The method comprises the steps of adding class-level associated information to a pre-trained model CLIP through a CLIP4CMR (CLIP for Supervised Cross-Modal Retireval, CLIP4 CMR) and using the CLIP as a backbone network to generate original feature representation of each mode, then sending the original feature representation to a multi-layer perceptron of each mode to learn a public representation space, aiming at the problem of lack of robustness of unknown classes, allocating a group of unified prototypes as class agents, and using a recent Prototype (near-Prototype) classification rule to carry out reasoning to solve the problem of lack of robustness of the unknown classes, wherein the class-level associated information is added to the pre-trained model CLIP.
However, the traditional processing method of cross-modal search in the prior art is to embed the text and image into the joint latent space via a two-tower structure model, and then apply a cosine similarity equidistance measurement method to make the model have higher similarity between the matched text and image, however, there is relatively large representation difference between the two modalities, so that there are difficulties in directly comparing the two modalities.
The invention provides a cross-modal retrieval method for solving the technical problems, and the method comprises the steps of firstly coding the features through a pre-training model CLIP to obtain an original image and text representation. To further enhance modal information interaction, the original modal characterization is then fed into the attention alignment module. That is, each query (within one batch) of the image (text) modality, the text (image) sample matching the query is more focused in the library of the size batch of the text (image) modality, and the mutual alignment of the single samples is realized. Meanwhile, the semantic association of the two kinds of modal information is enhanced. And finally, processing the data subjected to the operation by using a shared multilayer perceptron as weight parameter sharing, generating a common representation space for each modal data and increasing semantic restriction at the same time, so that the common representation of paired images and texts is as close as possible, and the internal tightness of classes and the difference between classes are enhanced at the same time.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is an overall framework diagram of a cross-modal retrieval method according to an embodiment of the present invention;
fig. 2 is an operation diagram of a mode alignment method according to an embodiment of the present invention;
FIG. 3 is a schematic view of an angle space of Arc4cmr loss according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a computer device according to an embodiment of the present invention;
fig. 5 is a result diagram of a visualization experiment provided by the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.
The invention discloses a cross-modal retrieval method, which comprises the following steps as shown in figure 1:
coding the characteristics of the image and text samples by adopting a CLIP pre-training model to obtain the original modal characteristics comprising an original image and a text;
performing attention alignment processing on the original modal characteristics to obtain modal alignment data so as to realize semantic correlation between the original modes;
the multi-layer perceptron which carries out weight sharing on the modal alignment data formed in the previous step is used for keeping the invariance of the modal;
and distributing the obtained final modal data to a normalized hypersphere by using an Arc4cmr loss function to carry out class boundary constraint.
In the scheme of the invention, the mode alignment is put on the code of the backbone network, so as to increase the matching degree of the same type of cross-mode data and the separation degree of the different type of mode data. One decomposition attention adjustment is made for each sample of modality 1 (image or text) and all samples within the batch of modality 2 (text or image). On the basis of obtaining the original feature representation through the CLIP encoder, the feature representation of one modality is re-represented by another modality through the decomplexible attribute mechanism to enhance the semantic association of the two modalities. In the process of modality alignment, a single query of a modality obtains a plurality of pieces of information approximately matched with another modality, so that richer semantic information is obtained. In order to prevent information loss caused by excessive specific gravity of irrelevant modes in a new characteristic representation mode, add operation is carried out on output characteristics subjected to mode alignment and original characteristics, and then Layer Normalization is carried out to ensure stability of data characteristic distribution in an optimization process and accelerate convergence of a model, namely a final image is represented as
Figure BDA0003911144660000061
The text is represented as
Figure BDA0003911144660000062
The mode alignment module adds the original image (text) feature representation and the image (text) feature re-represented by the text (image) and performs normalization processing, so that interaction of two mode information is promoted, the homogeneous polymerization degree and heterogeneous separation degree of cross-mode data are increased, and the precision of two search modes is enhanced to enable the precision of two search results to be improved in a balanced manner.
Specifically, as shown in fig. 2, similarity between each image and all text original features in the batch is actually calculated by using image original features (left-side stripe, and a plurality of images in the batch are distinguished by using color as an interval) as query Q (a one-to-many relationship is obtained by multiplying a Q by a plurality of ks to obtain an attention weight, that is, a blue stripe with different lengths between K and V in the drawing, and the longer length represents the more similar), and then the attention weight is multiplied by a text original feature specific feature value V to obtain a new text feature representation image feature represented by the text feature, that is, an aligned text feature representation (right-side stripe), in order to prevent the information represented by the image from being lost due to the fact that the weight distribution of semantically irrelevant original text features in the new aligned text feature representation is too large, the original image feature and the aligned text feature are added and subjected to Layer Normalization processing.
The above process refers to a specific attention-resolving adjustment process when the modality 1 is an original image and the modality 2 is an original text, that is, an operation existing in the process of retrieving a text (modality 2) by an image (modality 1). The mutual search between the two modes is involved, and the specific input of the QKV is changed when the text (mode 1) searches the image (mode 2). The core of the method is actually matrix operation, so the alignment principle is the same, and the method can be used for searching the image by transposing the attention weight matrix obtained in the process of searching the image for the text.
In addition, cross-modal search tasks require increasing the similarity within a class and aggregating and increasing the inter-class similarity as simultaneously as possibleVariability and inconsistency. In order to increase the compactness in classes and the separability between classes while satisfying classification and eliminate the problem of boundary ambiguity, additive angle margin loss (ArcFace) is applied to the field of cross-modal retrieval and named as Arc4cmr loss. The specific process is as follows: enforcing directly between nearest classes in angular space to maximize classification bound, features x i And corresponding weight W yi L2 regularization is performed such that | | W yi I | =1, and the normalized feature is multiplied by a rescale parameter s to make | | x i I | = s, i.e. such that the embedded features are distributed on a hypersphere with radius s; on the other hand in feature x i And a target weight W yi A self-defined additive angle margin m cos (theta) is added between the two yi + m) instead of cos θ yi The rest remain unchanged. In practice, here each weight w provides a class center that is brought to θ by an additional angular interval yi + m, the original corresponding output is smaller, the angle of the space is larger, so that the training difficulty is increased, the clustering to the class center is better, and the normalization step of the features and the weight makes the prediction only depend on the angle between the features and the weight; finally, at x i And W yi Adding corner edge punishment m, and simultaneously enhancing the class tightness and the class difference. The specific formula expression is expressed as formula 1 and formula 2, and the limitation condition is expressed.
Figure BDA0003911144660000071
Figure BDA0003911144660000072
In the above formula, the batch size is N, i.e., i =1,2 i For feature input, its class label is y i
Figure BDA0003911144660000073
Is a characteristic x i Corresponding to the weight W yi Cosine angle ofM is the corner edge penalty, n is the number of classes, i.e. k =1,2 k As a weight of each class, θ k To input feature x i Misjudge as non-y i Other k classes of class (corresponding to k classes weighted W k ) The cosine angle of (c). Equations 1,2 only make changes in the inputs for different search requirements. Loss function L for image retrieval text I2T I Input is as
Figure BDA0003911144660000081
Corresponding regularization
Figure BDA0003911144660000082
Namely, it is
Figure BDA0003911144660000083
Loss function L for text retrieval image T2I T Input becomes
Figure BDA0003911144660000084
Corresponding regularization
Figure BDA0003911144660000085
Namely, it is
Figure BDA0003911144660000086
In conclusion, the objective function of the proposed SMR-MA model is L Arc4cmr =L I +L T . An angular space diagram of the Arc4cmr penalty is shown in fig. 3, where different colors represent different categories, circles represent image modalities, and triangles represent text modalities.
In addition, the invention provides a cross-modal retrieval method and a cross-modal retrieval system, which specifically comprises the following steps:
an initial module: the method comprises the steps of coding the characteristics of an image and a text sample by adopting a CLIP pre-training model to obtain original mode characteristics comprising an original image and a text;
an alignment module: the original modal characteristics are subjected to attention alignment processing to obtain modal alignment data so as to realize semantic correlation between original modalities;
the weight sharing module: the multi-layer perceptron is used for sharing the modal alignment data formed in the previous step through weight so as to keep the invariance of the modality;
a normalization module: and the method is used for distributing the obtained final modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.
In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.
Experimental example 1
The overall performance of the cross-modal retrieval method implemented by the embodiment of the invention is compared with 8 representative baseline methods in the prior art, including 4 traditional methods, namely CCA, KCCA, corr-AE and JRL, and 4 deep learning-based methods, namely ACMR, CCL, DSCMR and CLIP4CMR. The Average Precision (mAP) of the cross-modal retrieval standard is used as an evaluation index, and mAP scores of an image retrieval text (I2T) and a text retrieval image (T2I) are compared and verified.
TABLE 1 comparison of mAP values on the reference dataset for SMR-MA and baseline methods
Figure BDA0003911144660000091
Comprehensive analysis experiments on three reference data sets show that the method has good performance in cross-modal retrieval tasks, and compared with the baseline method of the optimal results obtained on Wikipedia, pascal Sendence and NUS-WIDE at present, the SMR-MA improves the mAP by 9.4%, 0.7%, 3.4% and 8.7% respectively, achieves the effect of SOTA (state-of-the-art) and therefore has higher application value.
In order to visually observe the effectiveness of a maximum semantic correlation and modal alignment model (SMR-MA), whether high-dimensional images and text samples in a shared representation space are represented or not is observed to obtain good separability, and original 1024-dimensional high-dimensional data is projected to a 2-dimensional space for visualization through a t-SNE (t-distributed stored probabilistic Neighbor Embedding) nonlinear dimension reduction algorithm. And selecting a Wikipedia data set for a visual experiment. Fig. 5 (d) and 5 (e) show the original feature distributions of the image and text obtained by the CLIP vision encoder and the text encoder, respectively, and it can be seen from the two graphs that the degree of separation between classes and the degree of aggregation within the classes are low, resulting in a low accuracy of direct matching. Fig. 5 (a) and 5 (b) show the distribution of the image and text representations through SMR-MA, respectively, both can effectively discriminate samples of different semantic categories into corresponding semantic discrimination clusters. Fig. 5 (c) shows the degree of overlap of the feature embedding distributions of the two modalities in the common representation space, which indicates that this method has a significant effect on eliminating modality variability.
Fig. 4 is a schematic structural diagram of a computer device disclosed in the present invention. Referring to fig. 4, the computer device 400 includes at least a memory 402 and a processor 401; the memory 402 is connected to the processor through a communication bus 403, and is configured to store computer instructions executable by the processor 401, and the processor 401 is configured to read the computer instructions from the memory 402 to implement the steps of the cross-mode retrieval method according to any of the above embodiments.
For the above-mentioned apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal magnetic disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
Finally, it should be noted that: while this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (9)

1. A cross-modal retrieval method is characterized by comprising the following steps:
coding the features by adopting a CLIP pre-training model to obtain original modal features comprising original images and texts;
performing attention alignment processing on the original modal characteristics to obtain modal alignment data so as to realize semantic correlation between the original modalities;
keeping the modal invariance of the modal alignment data formed in the previous step through a weight-sharing multilayer perceptron;
and distributing the finally obtained modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.
2. The cross-modality retrieval method according to claim 1, wherein the attention-alignment processing method comprises:
through the decommissible Attention mechanism, each sample of the original image (text) contained in the modality 1 is readjusted by the text (image) contained in all the modality 2 samples in the batch, that is, the modality 1 data is re-represented by the modality 2 data.
3. The cross-modality retrieval method according to claim 2, further comprising, after the attention-alignment process:
performing Add operation on the output characteristics subjected to mode alignment and the characteristics of the original mode, and performing Normalization processing on the output characteristics subjected to mode alignment and the characteristics of the original mode to accelerate convergence of the model to obtain image mode characteristic data of the final characteristics
Figure FDA0003911144650000011
Text modal characteristic data of
Figure FDA0003911144650000012
4. The cross-modality retrieval method of claim 3, wherein the modality alignment method comprises: when the mode 1 is an original image and the mode 2 is an original text, the original features of the images in the batch are used as query Q, similarity between each image and all original features K of the texts in the batch is calculated, attention weight is obtained, and then the attention weight is multiplied by the specific feature value V of the original features of the texts to obtain the output features which are subjected to mode alignment.
5. The cross-modal search method of claim 2, wherein the method for performing class boundary constraint by distributing the finally obtained modal data onto a normalized hypersphere using the Arc4cmr loss function comprises:
will feature x i And corresponding weight W yi L2 regularization is performed such that | | | W yi I | =1, then the normalized feature is multiplied by a rescale parameter s, so that | | x i I | = s, i.e. such that the embedded features are distributed on a hypersphere with radius s;
at feature x i And target weight
Figure FDA0003911144650000021
Adds a self-defined additive angle edge distance m cos (theta) yi + m) instead of cos θ yi
6. The cross-modal search method of claim 5, wherein the method of distribution to the normalized hypersphere represents a specific formula:
Figure FDA0003911144650000022
Figure FDA0003911144650000023
in the above formula, the batch size is N, i.e., i =1,2 i For feature input, its class label is y i
Figure FDA0003911144650000024
Is a characteristic x i Corresponding weight W to yi M is the angular edge penalty, n is the number of classes, i.e. k =1,2 k For the weight of each class, θ k To input feature x i Wrongly judging as other k types other than yi types, wherein the weight of the corresponding k types is W k The cosine angle of (c);
loss function L for image retrieval text I2T I Input is as
Figure FDA0003911144650000025
Corresponding regularization
Figure FDA0003911144650000026
Namely, it is
Figure FDA0003911144650000027
Loss function L for text retrieval image T2I T Input becomes
Figure FDA0003911144650000028
Corresponding regularization
Figure FDA0003911144650000029
Namely, it is
Figure FDA00039111446500000210
Then the maximum word mentionedThe objective function used by the semantic correlation and modal alignment model is L Arc4cmr =L I +L T
7. A retrieval system using the cross-modality retrieval method according to any one of claims 1 to 6, characterized by comprising:
an initial module: the method comprises the steps of coding the characteristics of an image and a text sample by adopting a CLIP pre-training model to obtain original mode characteristics comprising an original image and a text;
an alignment module: the original modal characteristics are subjected to attention alignment processing to obtain modal alignment data so as to realize semantic correlation between the original modalities;
the weight sharing module: the multi-layer perceptron is used for sharing the modal alignment data formed in the previous step through weight so as to keep the invariance of the modality;
a normalization module: and the method is used for distributing the obtained final modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.
8. A computer-readable storage medium, on which a computer program is stored, which, when executed, carries out the steps of the cross-modality search method of any one of claims 1 to 6.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the cross-modality retrieval method according to any one of claims 1-6 are implemented when the program is executed by the processor.
CN202211322568.6A 2022-10-27 2022-10-27 Cross-modal retrieval method and retrieval system Pending CN115563316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211322568.6A CN115563316A (en) 2022-10-27 2022-10-27 Cross-modal retrieval method and retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211322568.6A CN115563316A (en) 2022-10-27 2022-10-27 Cross-modal retrieval method and retrieval system

Publications (1)

Publication Number Publication Date
CN115563316A true CN115563316A (en) 2023-01-03

Family

ID=84769402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211322568.6A Pending CN115563316A (en) 2022-10-27 2022-10-27 Cross-modal retrieval method and retrieval system

Country Status (1)

Country Link
CN (1) CN115563316A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905584A (en) * 2023-01-09 2023-04-04 共道网络科技有限公司 Video splitting method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905584A (en) * 2023-01-09 2023-04-04 共道网络科技有限公司 Video splitting method and device
CN115905584B (en) * 2023-01-09 2023-08-11 共道网络科技有限公司 Video splitting method and device

Similar Documents

Publication Publication Date Title
Latif et al. Content‐Based Image Retrieval and Feature Extraction: A Comprehensive Review
CN107209860B (en) Method, system, and computer storage medium for processing weakly supervised images
Amores Multiple instance classification: Review, taxonomy and comparative study
Maji et al. Efficient classification for additive kernel SVMs
Wang et al. A deep semantic framework for multimodal representation learning
CN106202256A (en) Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
Varga et al. Fast content-based image retrieval using convolutional neural network and hash function
Wu et al. Vehicle re-identification in still images: Application of semi-supervised learning and re-ranking
Li et al. Relevance feedback in content-based image retrieval: a survey
CN110008365B (en) Image processing method, device and equipment and readable storage medium
Cao et al. Learning to match images in large-scale collections
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN109766455A (en) A kind of full similitude reservation Hash cross-module state search method having identification
Boutell et al. Multi-label Semantic Scene Classfication
Ghrabat et al. Greedy learning of deep Boltzmann machine (GDBM)’s variance and search algorithm for efficient image retrieval
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
CN115309930A (en) Cross-modal retrieval method and system based on semantic identification
CN115563316A (en) Cross-modal retrieval method and retrieval system
Liang et al. Landmarking manifolds with Gaussian processes
Shirahama et al. Event retrieval in video archives using rough set theory and partially supervised learning
Polley et al. X-vision: explainable image retrieval by re-ranking in semantic space
Malisiewicz Exemplar-based representations for object detection, association and beyond
Li et al. SPA: spatially pooled attributes for image retrieval
Che et al. Image retrieval by information fusion based on scalable vocabulary tree and robust Hausdorff distance
Mercy Rajaselvi Beaulah et al. Categorization of images using autoencoder hashing and training of intra bin classifiers for image classification and annotation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination