CN115563316A

CN115563316A - Cross-modal retrieval method and retrieval system

Info

Publication number: CN115563316A
Application number: CN202211322568.6A
Authority: CN
Inventors: 强保华; 孙苹苹; 杨先一; 席广勇; 陈锐东
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-03

Abstract

The present invention provides a cross-modal retrieval method and a retrieval system. The retrieval method includes: using a CLIP pre-training model to encode features to obtain original modal features including original images and text; Perform attention alignment processing to obtain modality alignment data to realize the semantic correlation between the original modalities; pass the modality data formed by the above steps through a weight-sharing multi-layer perceptron to maintain modality invariance; use Arc4cmr The loss function distributes the final feature data onto a normalized hypersphere for category boundary constraints. The cross-modal retrieval method of the present invention makes the common representation of the paired images and texts as similar as possible, and simultaneously enhances the intra-class compactness and inter-class difference.

Description

A cross-modal retrieval method and retrieval system

技术领域technical field

本发明涉及语义最大相关及模态对齐的跨模态检索的领域，具体而言，涉及一种跨模态检索方法以及检索系统。The invention relates to the field of cross-modal retrieval with maximum semantic correlation and modal alignment, and specifically relates to a cross-modal retrieval method and retrieval system.

背景技术Background technique

信息资源已呈现出多模态数据(文本、图像、音频、视频等)的混合化态势，这些数据交叉关联，且逐步深度融合，并且这些多媒体数据呈现出快速增长的趋势。如何挖掘跨模态数据间隐藏的语义关联，实现跨模态信息检索是充分利用多模态数据资源的重要前提。Information resources have shown a mixed trend of multi-modal data (text, image, audio, video, etc.), these data are cross-correlated and gradually deeply integrated, and these multimedia data show a trend of rapid growth. How to mine the hidden semantic associations between cross-modal data and realize cross-modal information retrieval is an important prerequisite for making full use of multi-modal data resources.

随着数据规模和模型规模的不断增大，深度学习逐渐进入预训练模型时代，如何更好地将其应用于下游任务受到越来越多的关注，如CLIP、SimVLM等。此类预训练模型已有的文本图像推理能力对于不同的下游任务如图像描述(Image Captioning)、视觉问答(Visual Question Answering，VQA)、跨模态检索(Cross-Modal Retrieval)等都具有相对较好的迁移能力。相较于传统的图像分类方法，CLIP模型不再是为每张图像分配一个名词标签，而是一个句子，因此以往被强行分成同类的图像，就有了“无限细粒度”的标签。经由400亿对图像文本对通过无监督对比学习方法的预训练模型CLIP虽然已经获取到了丰富的文本-图像语义，但是CLIP对于两种模态数据的前期编码阶段依旧是相互独立的，仍然缺少模态间信息的交互。CLIP用对比损失约束给出两种模态匹配或不匹配的判断，且每条图像(文本)模态信息有且仅有一条文本(图像)模态信息与之匹配，忽略了一对多的近似匹配情况所包含的模态内、模态间的丰富的语义信息、区分度信息。With the continuous increase of data scale and model scale, deep learning has gradually entered the era of pre-training models, and how to better apply it to downstream tasks has attracted more and more attention, such as CLIP, SimVLM, etc. The existing text image reasoning ability of this type of pre-training model is relatively good for different downstream tasks such as image description (Image Captioning), visual question answering (Visual Question Answering, VQA), cross-modal retrieval (Cross-Modal Retrieval), etc. Good migration ability. Compared with traditional image classification methods, the CLIP model no longer assigns a noun label to each image, but a sentence. Therefore, images that were forcibly classified into the same category in the past have "infinitely fine-grained" labels. Although the pre-training model CLIP of the unsupervised comparative learning method has obtained rich text-image semantics through 40 billion image-text pairs, the pre-encoding stages of CLIP for the two modal data are still independent of each other, and there is still a lack of model. Interaction of information between states. CLIP uses contrastive loss constraints to give the judgment of two modal matches or mismatches, and each image (text) modal information has one and only one text (image) modal information that matches it, ignoring the one-to-many Intra-modal and inter-modal rich semantic information and discrimination information contained in the approximate matching situation.

有鉴于此，特提出本发明。In view of this, the present invention is proposed.

发明内容Contents of the invention

有鉴于此，本发明公开了一种新型的跨模态检索方法，先通过DecomposableAttention机制将一种模态的特征表示用另外一种模态重新表示，获取更为丰富的语义信息，同时增强两种模态的语义关联，然后在标签空间方面，利用Arc4cmr损失函数将学习到的多模态特征分布到归一化超球面上，特征和权值之间增加角边缘惩罚使得类别间有明确的决策边界，实现同时增强类内紧性和类间差异性。In view of this, the present invention discloses a new type of cross-modal retrieval method. Firstly, the feature representation of one modality is re-expressed by another modality through the Decomposable Attention mechanism, so as to obtain richer semantic information and enhance both Semantic associations of various modalities, and then in the label space, the Arc4cmr loss function is used to distribute the learned multimodal features on the normalized hypersphere, and the corner edge penalty is added between the features and weights to make the categories have clear distinctions. Decision boundary, which achieves simultaneous enhancement of intra-class compactness and inter-class difference.

具体地，本发明是通过以下技术方案实现的：Specifically, the present invention is achieved through the following technical solutions:

第一方面，本发明公开了一种新型的跨模态检索方法，包括如下步骤：In the first aspect, the present invention discloses a novel cross-modal retrieval method, comprising the following steps:

采用CLIP预训练模型对图像和文本样本的特征进行编码，获得包括原始图像以及文本的原始模态特征；Use the CLIP pre-training model to encode the features of the image and text samples to obtain the original modality features including the original image and text;

将所述原始模态特征进行注意力对齐处理得到模态对齐数据以实现原始模态互相之间的语义相关；performing attention alignment processing on the original modality features to obtain modality alignment data so as to realize the semantic correlation between the original modality;

将上述步骤形成的所述模态对齐数据通过权重共享的多层感知机以保持模态的不变性；The mode alignment data formed by the above steps is passed through a weight-sharing multi-layer perceptron to maintain the invariance of the mode;

利用Arc4cmr损失函数将得到的模态数据分布到归一化超球面上进行类别边界约束。The obtained modal data is distributed onto a normalized hypersphere using the Arc4cmr loss function for class boundary constraints.

第二方面，本发明公开了一种跨模态检索系统，包括：In the second aspect, the present invention discloses a cross-modal retrieval system, including:

初始模块：用于采用CLIP预训练模型对图像和文本样本的特征进行编码，获得包括原始图像以及文本的原始模态特征；Initial module: used to encode the features of image and text samples using the CLIP pre-training model to obtain the original modal features including the original image and text;

对齐模块：用于将所述原始模态特征进行注意力对齐处理得到模态对齐数据以实现原始模态互相之间的语义相关；Alignment module: for performing attention alignment processing on the original modality features to obtain modality alignment data so as to realize the semantic correlation between the original modality;

权重共享模块：用于将上述步骤形成的所述模态对齐数据通过权重共享的多层感知机以保持模态的不变性；Weight sharing module: used to pass the modal alignment data formed by the above steps through a weight-sharing multi-layer perceptron to maintain the invariance of the modal;

归一化模块：利用Arc4cmr损失函数将得到的最终模态数据分布到归一化超球面上进行类别边界约束。Normalization module: Use the Arc4cmr loss function to distribute the obtained final modal data onto a normalized hypersphere for category boundary constraints.

第三方面，本发明公开了一种计算机可读存储介质，其上存储有计算机程序，所述程序被处理器执行时实现如第一方面所述跨模态检索方法的步骤。In a third aspect, the present invention discloses a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the cross-modal retrieval method as described in the first aspect are implemented.

第四方面，本发明公开了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如第一方面所跨模态检索方法的步骤。In a fourth aspect, the present invention discloses a computer device, which includes a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the computer program described in the first aspect is implemented. The steps of the modal retrieval method.

现有技术中的跨模态检索通过对底层特征异构、高层语义相关的不同模态数据进行相似性的度量，以解决异构性差异问题，总体可分为无监督检索和有监督检索两种。The cross-modal retrieval in the existing technology solves the problem of heterogeneity difference by measuring the similarity of different modal data with heterogeneous underlying features and high-level semantic correlation. It can be generally divided into two types: unsupervised retrieval and supervised retrieval. kind.

无监督跨模态检索：典型相关分析(Canonical Correlation Analysis,CCA)本质上是一种多元统计分析，利用多个图像和文本匹配对之间的相关关系得到一个成对相似性最大的无监督公共子空间，并将图像特征和文本特征映射到公共子空间中得到不同模态数据的统一表征，反映两种模态之间的整体相关性，从而实现跨模态检索。内核相关分析(Kernel CCA，KCCA)引入内核kernel的技巧来改进CCA对于两个变量间存在非线性相关的关系时也许会失效的情况。关联自编码器(Correspondence Autoencoders，Corr-AE)利用自动编码器考虑跨模态检索中的重构误差和相关损失。Unsupervised cross-modal retrieval: Canonical Correlation Analysis (CCA) is essentially a multivariate statistical analysis, which uses the correlation between multiple image and text matching pairs to obtain an unsupervised common subspace, and image features and text features are mapped to the common subspace to obtain a unified representation of different modal data, reflecting the overall correlation between the two modalities, thereby achieving cross-modal retrieval. Kernel correlation analysis (Kernel CCA, KCCA) introduces kernel kernel techniques to improve the situation where CCA may fail when there is a nonlinear correlation between two variables. Correspondence Autoencoders (Corr-AE) utilize autoencoders to consider reconstruction error and correlation loss in cross-modal retrieval.

有监督跨模态检索：联合表征学习(Joint Representation Learning,JRL)在统一框架中整合不同媒体类型的稀疏和半监督正则化，共同探索成对相关性和语义信息。对抗交叉模态检索(adversarial cross-modal retrieval,ACMR)试图通过对抗学习的思想进行分类来区分不同的模态。跨模态关联学习(cross-modal correlation learning，CCL)通过多任务学习的方式挖掘不同媒体类型数据的粗细粒度信息。深度监督跨模式检索方法(Deep Supervised Cross-Modal Retrieval，DSCMR)在公共表示空间中通过对样本进行线性分类以保留语义的区分性，通过权值共享策略以保持模态的不变性在公共表示空间中。为预训练模型CLIP增加类级关联信息的CLIP4CMR(CLIP for Supervised Cross-ModalRetrieval，CLIP4CMR)将CLIP作为骨干网络生成每种模态原始特征表示，然后送入各自模态的多层感知机来学习公共表示空间，针对未知类别缺乏鲁棒性问题，通过分配一组统一原型作为类代理，并利用最近原型(Nearest-Prototype)分类规则进行推理来解决未知类别缺乏鲁棒性的问题，通过为预训练模型CLIP增加了类级关联信息。Supervised cross-modal retrieval: Joint Representation Learning (JRL) integrates sparse and semi-supervised regularization across different media types in a unified framework to jointly explore pairwise correlations and semantic information. Adversarial cross-modal retrieval (ACMR) attempts to classify different modalities through the idea of adversarial learning. Cross-modal correlation learning (CCL) mines coarse-grained information of different media types data through multi-task learning. The deep supervised cross-modal retrieval method (Deep Supervised Cross-Modal Retrieval, DSCMR) in the public representation space by linearly classifying the samples to preserve the semantic distinction, through the weight sharing strategy to maintain the invariance of the modality in the public representation space middle. CLIP4CMR (CLIP for Supervised Cross-Modal Retrieval, CLIP4CMR), which adds class-level association information to the pre-training model CLIP, uses CLIP as the backbone network to generate the original feature representation of each modality, and then sends it to the multi-layer perceptron of each modality to learn common Representation space, for the lack of robustness of unknown categories, by assigning a set of unified prototypes as class agents, and using the nearest prototype (Nearest-Prototype) classification rules for reasoning to solve the problem of lack of robustness of unknown categories, by pre-training Model CLIP adds class-level association information.

然而，现有技术中跨模态检索的传统处理方式为经由双塔结构模型将文本和图像嵌入到联合潜在空间中，然后应用余弦相似性等距离度量方式让模型使匹配的文本-图像之间具有更高的相似性，然而，两种模态之间存在相对较大的表示差异，使得直接比较这两种模态本身存在诸多困难。However, the traditional approach to cross-modal retrieval in the prior art is to embed text and images into a joint latent space via a two-tower structure model, and then apply a cosine similarity equidistant metric to let the model make the matched text-image With higher similarity, however, there are relatively large representation differences between the two modalities, making direct comparison of the two modalities inherently difficult.

本发明为了解决上述技术问题提供了一种跨模态的检索方法，先通过预训练模型CLIP对特征进行编码，获得原始图像和文本表示。为进一步增进模态信息交互，然后将原始模态表征送入到注意力对齐模块。即为，图像(文本)模态的每条查询(在一个batch内)，在大小为batch的文本(图像)模态库中更加关注与查询匹配的那条文本(图像)样本，实现单条样本的相互对齐。同时实现了增强两种模态信息的语义关联。最后运用共享多层感知机做权重参数共享来处理经过上述操作的数据，为每种模态数据生成公共表示空间的同时增加语义限制，使得成对的图像与文本的公共表示尽可能相近，实现同时增强类内紧性和类间差异性。In order to solve the above-mentioned technical problems, the present invention provides a cross-modal retrieval method. Firstly, the features are encoded by the pre-trained model CLIP to obtain the original image and text representation. To further enhance the modal information interaction, the original modal representation is then fed into the attention alignment module. That is, for each query of the image (text) modality (in a batch), pay more attention to the text (image) sample that matches the query in the text (image) modality library of batch size, and realize a single sample aligned with each other. At the same time, the semantic association of the two modal information is enhanced. Finally, the shared multi-layer perceptron is used to share the weight parameters to process the data after the above operations, and to generate a common representation space for each modality data while adding semantic restrictions, so that the public representation of the paired image and text is as close as possible to achieve At the same time, it enhances the intra-class compactness and inter-class difference.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

图1为本发明实施例提供的跨模态检索方法的整体框架图；FIG. 1 is an overall framework diagram of a cross-modal retrieval method provided by an embodiment of the present invention;

图2为本发明实施例提供的模态对齐方法的操作示意图；FIG. 2 is a schematic diagram of the operation of the modality alignment method provided by the embodiment of the present invention;

图3为本发明实施例提供的Arc4cmr损失的角度空间示意图；Fig. 3 is the angular space schematic diagram of Arc4cmr loss that the embodiment of the present invention provides;

图4为本发明实施例提供的一种计算机设备的流程示意图；FIG. 4 is a schematic flowchart of a computer device provided by an embodiment of the present invention;

图5为本发明实施例提供的可视化实验的结果图。Fig. 5 is a result diagram of a visualization experiment provided by an embodiment of the present invention.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

在本公开使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本公开可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本公开范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present disclosure, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

本发明公开了一种跨模态检索方法，如图1所示，包括如下步骤：The invention discloses a cross-modal retrieval method, as shown in Figure 1, comprising the following steps:

将上述步骤形成的所述模态对齐数据进行权重共享的多层感知机以保持模态的不变性；A multi-layer perceptron for weight sharing of the modality alignment data formed by the above steps to maintain the invariance of the modality;

利用Arc4cmr损失函数将得到的最终模态数据分布到归一化超球面上进行类别边界约束。The resulting final modal data is distributed onto a normalized hypersphere using the Arc4cmr loss function for category boundary constraints.

在本发明的方案中，主张将模态对齐置于骨干网络的编码之上，以增加同类跨模态数据的匹配度以及异类模态数据的分离度。对于模态1(图像或文本)的每条样本与模态2(文本或图像)的batch内的所有样本进行一个分解注意力的调整。在通过CLIP编码器获得原始特征表示的基础上，通过Decomposable Attention机制将一种模态的特征表示用另外一种模态的来重新表示，以增强两种模态的语义关联。在模态对齐的过程中，使模态的单个查询获取多个近似匹配另一模态信息，从而获取更为丰富的语义信息。为防止新的特征表示方式中不相关模态的比重过大导致信息丢失，将经过模态对齐的输出特征与原始特征进行Add操作，再到Layer Normalization以保证在优化过程中数据特征分布的稳定性，加速模型的收敛，即最终图像表示为

文本表示为

模态对齐模块将原始图像(文本)特征表示与用文本(图像)重新表示的图像(文本)特征进行相加并做归一化处理，促进两模态信息的交互，增加跨模态数据的同类聚合度以及异类分离度，同时增强两种搜索模式的精度使两种检索结果的精度都能得到均衡的提高。In the solution of the present invention, it is advocated that modality alignment be placed on the encoding of the backbone network to increase the matching degree of cross-modal data of the same type and the separation degree of heterogeneous modality data. A decomposed attention adjustment is performed for each sample of modality 1 (image or text) and all samples in the batch of modality 2 (text or image). Based on the original feature representation obtained through the CLIP encoder, the feature representation of one modality is re-represented by another modality through the Decomposable Attention mechanism to enhance the semantic association of the two modalities. In the process of modality alignment, a single query of a modality can obtain multiple approximate matching information of another modality, so as to obtain richer semantic information. In order to prevent the loss of information due to the large proportion of irrelevant modes in the new feature representation, the output features after modality alignment are added to the original features, and then to Layer Normalization to ensure the stability of the data feature distribution during the optimization process , speed up the convergence of the model, that is, the final image is expressed as

text expressed as

The modality alignment module adds the original image (text) feature representation and the image (text) feature re-expressed with text (image) and performs normalization processing to promote the interaction of two modal information and increase the cross-modal data. The degree of aggregation of the same type and the degree of separation of different types, and the accuracy of the two search modes are enhanced at the same time, so that the accuracy of the two retrieval results can be improved in a balanced manner.

具体地，如图2所示，实际上以批batch内的图像原始特征(左侧条纹格，以颜色作为间隔区分batch内的多个图像)作为查询Q每一个图像与批batch内的所有文本原始特征计算相似性(一对多的关系，一个Q与多个K相乘得到注意力权重也就是图中K和V之间长短不一的蓝色长条，越长代表越相似)，然后用注意力权重与文本原始特征具体特征取值V相乘得到新的用文本特征表示图像特征，即对齐的文本特征表示(右侧条纹格)，为防止新的对齐的文本特征表示中语义不相关的原始文本特征权重分配比重过大导致图像所表示的信息丢失，再将原始图像特征与对齐的文本特征进行相加与层归一化LayerNormalization处理。Specifically, as shown in Figure 2, in fact, the original features of the images in the batch (striped grid on the left, using colors as intervals to distinguish multiple images in the batch) are used as the query Q for each image and all the text in the batch The original feature calculates the similarity (one-to-many relationship, one Q is multiplied by multiple K to get the attention weight, which is the blue bar of different lengths between K and V in the figure, the longer the representative, the more similar), and then Multiply the attention weight with the specific feature value V of the original text feature to obtain a new text feature to represent the image feature, that is, the aligned text feature representation (striped grid on the right), in order to prevent semantic inconsistency in the new aligned text feature representation The weight distribution of relevant original text features is too large, resulting in the loss of information represented by the image, and then the original image features and aligned text features are added and layer normalized LayerNormalization processing.

上述过程是指当模态1为原始图像，模态2为原始文本时的具体分解注意力的调整过程，即图像(模态1)检索文本(模态2)过程存在的操作。这里涉及的是两种模态之间的相互检索，同理文本(模态1)检索图像(模态2)的就是将QKV的具体输入换一下。其核心实际为矩阵运算，所以对齐原理是一样的，只是将图像检索文本过程得到的注意力权重矩阵进行转置即可用于文本检索图像。The above process refers to the specific adjustment process of decomposing attention when modality 1 is the original image and modality 2 is the original text, that is, the operation that exists in the image (modal 1) retrieval text (modal 2) process. What is involved here is the mutual retrieval between the two modalities. Similarly, the text (modal 1) retrieval image (modal 2) is to change the specific input of QKV. Its core is actually a matrix operation, so the alignment principle is the same, only the attention weight matrix obtained in the image retrieval text process is transposed to be used for text retrieval images.

另外，跨模态检索任务要求尽可能地同时增大类内的相似性和聚合和增大类间差异性和不一致性。为满足分类的同时增大类内紧凑性与类间分离性，消除边界模糊性问题，采用加性角边距损失(ArcFace)应用于跨模态检索领域，并将其命名为Arc4cmr损失。具体过程为：直接在角度空间中在最近的类之间强制执行来最大化分类界限，将特征x_i和对应权重W_yi进行L2正则化，使得||W_yi||＝1，标准化处理后的特征再乘以一个重缩放rescale参数s，使得||x_i||＝s，即使得嵌入特征分布在半径为s的超球上；另一方面在特征x_i和目标权重W_yi之间添加了一个自定义的加性角度边距(additive angular margin)m用cos(θ_yi+m)来代替原来的cosθ_yi，其余的保持不变。实际上，这里每个权重w都提供了一个类别中心，通过附加的角度间隔使之变为θ_yi+m，使原本的对应输出更小，空间的角度变大，从而增加训练难度，更加向类中心聚集，特征和权值的归一化步骤使预测仅依赖于特征和权值之间的角度；最后，在x_i和W_yi之间添加角边缘惩罚m，同时增强类内紧性和类间差异。具体的公式表达为公式1、2表述的是限制条件。In addition, the cross-modal retrieval task requires increasing the intra-class similarity and aggregation and increasing the inter-class difference and inconsistency as much as possible. In order to increase the intra-class compactness and inter-class separation while satisfying the classification, and eliminate the boundary ambiguity problem, the additive angular margin loss (ArcFace) is applied to the field of cross-modal retrieval, and it is named Arc4cmr loss. The specific process is: directly enforcing between the nearest classes in the angle space to maximize the classification bounds, L2 regularizes the feature x _i and the corresponding weight W _yi so that ||W _yi ||=1, after normalization The feature of is multiplied by a rescaled rescale parameter s, so that || _xi ||=s, that is, the embedded features are distributed on a hypersphere with a radius of s; on the other hand, between the feature x _i and the target weight W _yi A custom additive angular margin m is added to replace the original cosθ _yi with cos(θ _yi +m), and the rest remain unchanged. In fact, each weight w here provides a category center, which is changed to θ _yi +m by adding an angular interval, so that the original corresponding output is smaller and the spatial angle becomes larger, thus increasing the difficulty of training and more towards The clustering of class centers, the normalization step of features and weights make the predictions only depend on the angle between features and weights; finally, an angle edge penalty m is added between _xi and W _yi , while enhancing intra-class compactness and Differences between classes. The specific formulas expressed as formulas 1 and 2 are restrictive conditions.

上述公式中，批大小为N，即i＝1，2，...，N，x_i为特征输入，其类别标签为y_i，

为特征x_i与其对应权重W_yi的余弦夹角，m为角边缘惩罚，n为类别数目，即k＝1，2，...，n，W_k为各个类别的权重，θ_k为将输入特征x_i误判为非y_i类的其他k类别(对应k类权重为W_k)的余弦夹角。对于不同的检索需求公式1、2仅作输入上的变化。对于图像检索文本I2T的损失函数L_I，输入为

相应的正则化为

即

对于文本检索图像T2I的损失函数L_T，输入变为

相应的正则化为

即

综上，所提的SMR-MA模型的目标函数为L_Arc4cmr＝L_I+L_T。Arc4cmr损失的角度空间示意图如图3所示，其中不同的颜色代表不同的类别，圆圈代表图像模态，三角代表文本模态。In the above formula, the batch size is N, that is, i=1, 2, ..., N, x _i is the feature input, and its category label is y _i ,

is the cosine angle between the feature x _i and its corresponding weight W _yi , m is the angle edge penalty, n is the number of categories, that is, k=1, 2,..., n, W _k is the weight of each category, θ _k is the The input feature x _i is misjudged as the cosine angle of other k categories (corresponding to k category weight W _k ) that is not y _i category. For different retrieval requirements, formulas 1 and 2 only change the input. For the loss function L _I of image retrieval text I2T, the input is

The corresponding regularization is

which is

For the loss function LT of text retrieval image _T2I , the input becomes

The corresponding regularization is

which is

In summary, the objective function of the proposed SMR-MA model is L _Arc4cmr = L _I + L _T . The schematic diagram of the angular space of Arc4cmr loss is shown in Fig. 3, where different colors represent different categories, circles represent image modalities, and triangles represent text modalities.

另外，本发明除了提供一种跨模态检索方法，还提供了一种跨模态的检索系统，具体包括：In addition, in addition to providing a cross-modal retrieval method, the present invention also provides a cross-modal retrieval system, which specifically includes:

归一化模块：用于利用Arc4cmr损失函数将得到的最终模态数据分布到归一化超球面上进行类别边界约束。Normalization module: used to distribute the obtained final modal data to the normalized hypersphere by using the Arc4cmr loss function for category boundary constraints.

具体实施时，以上各个模块可以作为独立的实体来实现，也可以进行任意组合，作为同一或若干个实体来实现，以上各个单元的具体实施可参见前面的方法实施例，在此不再赘述。During specific implementation, each of the above modules may be implemented as an independent entity, or may be combined arbitrarily as the same or several entities. For the specific implementation of each of the above units, please refer to the previous method embodiments, which will not be repeated here.

实验例1Experimental example 1

将采用本发明实施例实施的跨模态检索方法与现有技术中8种具有代表性的基线方法进行了整体性能比较，包括4种传统方法，CCA、KCCA、Corr-AE、JRL，以及4种基于深度学习的方法，ACMR、CCL、DSCMR和CLIP4CMR。以跨模态检索标准的平均查准率(mean AveragePrecision，mAP)为评价指标，比较验证了以图像检索文本(I2T)和以文本检索图像(T2I)的mAP分值。The overall performance of the cross-modal retrieval method implemented by the embodiment of the present invention was compared with 8 representative baseline methods in the prior art, including 4 traditional methods, CCA, KCCA, Corr-AE, JRL, and 4 A deep learning based method, ACMR, CCL, DSCMR and CLIP4CMR. Taking the mean Average Precision (mAP) of the cross-modal retrieval standard as the evaluation index, the mAP scores of text retrieval from image (I2T) and image retrieval from text (T2I) were compared and verified.

表1 SMR-MA和基线方法在基准数据集上mAP值的对比Table 1 Comparison of mAP values between SMR-MA and baseline methods on benchmark datasets

在三个基准数据集上的综合分析实验表明，本发明的方法在跨模态检索任务中具有良好的性能，相较于目前在Wikipedia、PascalSentence和NUS-WIDE上取得的最优结果的基线方法，所提SMR-MA在mAP分别提高了9.4％、0.7％、3.4％和8.7％，达到了SOTA(state-of-the-art)的效果，因此具有更高的应用价值。Comprehensive analysis experiments on three benchmark datasets show that the proposed method has good performance in cross-modal retrieval tasks, compared with the current baseline methods that achieve the best results on Wikipedia, PascalSentence and NUS-WIDE , the proposed SMR-MA improves the mAP by 9.4%, 0.7%, 3.4% and 8.7%, respectively, and achieves the effect of SOTA (state-of-the-art), so it has higher application value.

为了直观地观察最大语义相关及模态对齐模型(SMR-MA)的有效性，观察在共享表示空间的高维的图像和文本样本的表示有没有取得很好的可分性，通过t-SNE(t-distributed Stochastic Neighbor Embedding)这一非线性降维算法，将原始的1024维的高维数据投影到2维空间进行可视化。选用维基百科数据集做可视化实验。图5(d)和图5(e)分别表示经过CLIP视觉编码器和文本编码器获得的图像和文本的原始特征分布，两张图可以看到，类间分离度和类内的聚集程度都较低，导致直接进行匹配准确率不高。图5(a)和图5(b)分别展示了图像和文本表示经过SMR-MA的分布情况，两者都能够有效的将不同语义类别的样本进行判别，划分为相应的语义判别簇。图5(c)展示了在公共表示空间中两种模态的特征嵌入分布的重叠程度，这表明该方法对于消除模态差异性有明显效果。In order to intuitively observe the effectiveness of the maximum semantic relevance and modality alignment model (SMR-MA), observe whether the representation of high-dimensional image and text samples in the shared representation space has achieved good separability, through t-SNE (t-distributed Stochastic Neighbor Embedding), a non-linear dimensionality reduction algorithm, projects the original 1024-dimensional high-dimensional data into a 2-dimensional space for visualization. Use the Wikipedia dataset for visualization experiments. Figure 5(d) and Figure 5(e) show the original feature distribution of the image and text obtained by the CLIP visual encoder and text encoder respectively. It can be seen from the two figures that the degree of separation between classes and the degree of aggregation within classes are both Low, resulting in low accuracy of direct matching. Figure 5(a) and Figure 5(b) respectively show the distribution of image and text representations through SMR-MA, both of which can effectively discriminate samples of different semantic categories and divide them into corresponding semantic discrimination clusters. Figure 5(c) demonstrates how much the feature embedding distributions of the two modalities overlap in the common representation space, which shows that the method has a significant effect on eliminating modality differences.

图4为本发明公开的一种计算机设备的结构示意图。参考图4所示，该计算机设备400，至少包括存储器402和处理器401；所述存储器402通过通信总线403和处理器连接，用于存储所述处理器401可执行的计算机指令，所述处理器401用于从所述存储器402读取计算机指令以实现上述任一实施例所述的跨模态检索方法的步骤。FIG. 4 is a schematic structural diagram of a computer device disclosed in the present invention. 4, the computer device 400 includes at least a memory 402 and a processor 401; the memory 402 is connected to the processor through a communication bus 403, and is used to store computer instructions executable by the processor 401, and the processing The device 401 is configured to read computer instructions from the memory 402 to implement the steps of the cross-modal retrieval method described in any of the above-mentioned embodiments.

对于上述装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the above device embodiments, since they basically correspond to the method embodiments, for relevant parts, please refer to part of the description of the method embodiments. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. It can be understood and implemented by those skilled in the art without creative effort.

适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备，例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部磁盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

最后应说明的是：虽然本说明书包含许多具体实施细节，但是这些不应被解释为限制任何发明的范围或所要求保护的范围，而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面，在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外，虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护，但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除，并且所要求保护的组合可以指向子组合或子组合的变型。A final note: While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as primarily describing features of particular embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

类似地，虽然在附图中以特定顺序描绘了操作，但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行，以实现期望的结果。在某些情况下，多任务和并行处理可能是有利的。此外，上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离，并且应当理解，所描述的程序组件和系统通常可以一起集成在单个软件产品中，或者封装成多个软件产品。Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product in, or packaged into multiple software products.

由此，主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下，权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外，附图中描绘的处理并非必需所示的特定顺序或顺次顺序，以实现期望的结果。在某些实现中，多任务和并行处理可能是有利的。Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述仅为本公开的较佳实施例而已，并不用以限制本公开，凡在本公开的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本公开保护的范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the present disclosure within the scope of protection.

Claims

1. A cross-modal retrieval method is characterized by comprising the following steps:

coding the features by adopting a CLIP pre-training model to obtain original modal features comprising original images and texts;

performing attention alignment processing on the original modal characteristics to obtain modal alignment data so as to realize semantic correlation between the original modalities;

keeping the modal invariance of the modal alignment data formed in the previous step through a weight-sharing multilayer perceptron;

and distributing the finally obtained modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.

2. The cross-modality retrieval method according to claim 1, wherein the attention-alignment processing method comprises:

through the decommissible Attention mechanism, each sample of the original image (text) contained in the modality 1 is readjusted by the text (image) contained in all the modality 2 samples in the batch, that is, the modality 1 data is re-represented by the modality 2 data.

3. The cross-modality retrieval method according to claim 2, further comprising, after the attention-alignment process:

performing Add operation on the output characteristics subjected to mode alignment and the characteristics of the original mode, and performing Normalization processing on the output characteristics subjected to mode alignment and the characteristics of the original mode to accelerate convergence of the model to obtain image mode characteristic data of the final characteristics

Text modal characteristic data of

4. The cross-modality retrieval method of claim 3, wherein the modality alignment method comprises: when the mode 1 is an original image and the mode 2 is an original text, the original features of the images in the batch are used as query Q, similarity between each image and all original features K of the texts in the batch is calculated, attention weight is obtained, and then the attention weight is multiplied by the specific feature value V of the original features of the texts to obtain the output features which are subjected to mode alignment.

5. The cross-modal search method of claim 2, wherein the method for performing class boundary constraint by distributing the finally obtained modal data onto a normalized hypersphere using the Arc4cmr loss function comprises:

will feature x _i And corresponding weight W _yi L2 regularization is performed such that | | | W _yi I | =1, then the normalized feature is multiplied by a rescale parameter s, so that | | x _i I | = s, i.e. such that the embedded features are distributed on a hypersphere with radius s;

at feature x _i And target weight

Adds a self-defined additive angle edge distance m cos (theta) _yi + m) instead of cos θ _yi 。

6. The cross-modal search method of claim 5, wherein the method of distribution to the normalized hypersphere represents a specific formula:

in the above formula, the batch size is N, i.e., i =1,2 _i For feature input, its class label is y _i ，

Is a characteristic x _i Corresponding weight W to _yi M is the angular edge penalty, n is the number of classes, i.e. k =1,2 _k For the weight of each class, θ _k To input feature x _i Wrongly judging as other k types other than yi types, wherein the weight of the corresponding k types is W _k The cosine angle of (c);

loss function L for image retrieval text I2T _I Input is as

Corresponding regularization

Namely, it is

Loss function L for text retrieval image T2I _T Input becomes

Corresponding regularization

Namely, it is

Then the maximum word mentionedThe objective function used by the semantic correlation and modal alignment model is L _Arc4cmr ＝L _I +L _T 。

7. A retrieval system using the cross-modality retrieval method according to any one of claims 1 to 6, characterized by comprising:

an initial module: the method comprises the steps of coding the characteristics of an image and a text sample by adopting a CLIP pre-training model to obtain original mode characteristics comprising an original image and a text;

an alignment module: the original modal characteristics are subjected to attention alignment processing to obtain modal alignment data so as to realize semantic correlation between the original modalities;

the weight sharing module: the multi-layer perceptron is used for sharing the modal alignment data formed in the previous step through weight so as to keep the invariance of the modality;

a normalization module: and the method is used for distributing the obtained final modal data to a normalized hypersphere by utilizing an Arc4cmr loss function to carry out class boundary constraint.

8. A computer-readable storage medium, on which a computer program is stored, which, when executed, carries out the steps of the cross-modality search method of any one of claims 1 to 6.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the cross-modality retrieval method according to any one of claims 1-6 are implemented when the program is executed by the processor.