CN118113888A

CN118113888A - A cross-modal fine-grained retrieval method based on multi-channel fusion

Info

Publication number: CN118113888A
Application number: CN202311663327.2A
Authority: CN
Inventors: 陈乔松; 陈浩; 李远路; 刘峻卓; 张冶; 张冬
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-05-31

Abstract

The present invention proposes a cross-modal fine-grained retrieval method based on multi-channel fusion. On the one hand, this method uses a branch network to extract the deep feature information of four modalities. This method can greatly extract the features that are specific to each modality and make full use of the feature information of each modality. On the other hand, after the deep feature information of each modality is extracted, it is divided into four channels and then reorganized, so that each group of reorganized information contains the deep feature information of the four modalities. In this way, when the model is learned, it can not only learn the information of this modality, but also learn the information brought by other modalities in advance, which greatly enhances the information interaction ability between each modality, thereby enhancing the classification ability of the model, providing more accurate classification results for subsequent retrieval tasks, and further improving the cross-modal retrieval ability of the model. This technology can be applied to search engines or public security systems to effectively improve the retrieval accuracy and crime detection efficiency.

Description

A cross-modal fine-grained retrieval method based on multi-channel fusion

技术领域Technical Field

本发明属于人工智能领域，具体涉及深度学习、跨模态检索、细粒度检索、信道融合等技术，尤其涉及基于多信道融合的跨模态细粒度检索方法。The present invention belongs to the field of artificial intelligence, and specifically relates to technologies such as deep learning, cross-modal retrieval, fine-grained retrieval, and channel fusion, and in particular to a cross-modal fine-grained retrieval method based on multi-channel fusion.

背景技术Background technique

随着社会科技的发展，视频、图像、文本和音频四个渠道已经成为人类认识世界和相互交流的主要形式。多模态数据的快速增长带来了跨模态检索的大量应用程序需求，其目标是实现跨模态内容的互相检索。与单模态检索相比，由于模态间变化较大，跨模态检索具有更高的技术挑战。然而，目前的跨模态检索任务通常集中在粗粒度上，远远不能满足实际应用的需要。相比之下，无论是在工业界还是学术界，细粒度检索都具有更大的应用需求和研究价值。因此，跨模态细粒度检索已成为重要的研究方向，需要发展更多的跨模态细粒度检索理论和技术。With the development of social science and technology, the four channels of video, image, text and audio have become the main forms for humans to understand the world and communicate with each other. The rapid growth of multimodal data has brought about a large number of application requirements for cross-modal retrieval, whose goal is to achieve mutual retrieval of cross-modal content. Compared with single-modal retrieval, cross-modal retrieval has higher technical challenges due to the large changes between modalities. However, current cross-modal retrieval tasks are usually focused on coarse granularity, which is far from meeting the needs of practical applications. In contrast, fine-grained retrieval has greater application needs and research value, both in industry and academia. Therefore, cross-modal fine-grained retrieval has become an important research direction, and more cross-modal fine-grained retrieval theories and technologies need to be developed.

现有的跨模态检索方法主要集中在图像-文本对的模态，而对于图像、视频、音频及文本的四模态研究较少。多模态检索指的是用户给定任意一个模态的样本作为查询样例，系统检索得到并反馈与查询样例同属一个类别的各个模态样本，多模态检索一般要求模态数量大于等于两个，这也给实验带来较大难度。在传统的基于深度学习的跨模态细粒度检索模型中一般有两种方法。一种是针对不同的模态使用不同的神经网络模型提取特征向量，如使用一个图像特征提取器和一个线性分类器来预测一些标签，并联合训练一个图像编码器和一个文本编码器来预测一批(图像、文本)训练样本的正确配对。在测试时，学习的文本编码器通过嵌入目标数据集的类的名称或描述来合成零样本线性分类器。另一种方法是使用一个主干网络同时提取不同模态的特征向量，如采用ResNet50作为基本的深度模型，以448*448作为输入大小，并在最后一个卷积层之后经过内核大小为14，步数为1的平均池化层，宗旨是使用一个网络提取所有模态的特征。有些方法提出一种用于特征学习的双路径注意模型，该模型集成了深度卷积神经网络、注意力机制和递归神经网络来学习跨模态细粒度显著特征，充分挖掘不同模态数据间的细粒度语义相关性。另一些方法将条形池化应用于图像和文本模态，这是一种轻量级的空间注意力机制，用于捕捉模态空间语义信息，同时探索二阶协方差池获取多模态语义表示，捕获模态通道语义信息，实现图像文本模态间的语义对齐。还有方法分别提取显著图像区域特征和带有上下文的词特征，构建了图像区域与文本词的细粒度相似矩阵，以及基于注意机制的图像潜在空间和文本潜在空间，从而对图像区域特征和相应的图像上下文特征进行语义监督。使用一个主干网络提取特征的方法着重提取模态间联系与共性，但联系与共性只是所有数据的一小部分，造成了大量有效的模态特异信息的损失，使用分支网络提取各个模态的特征着重利用模态特异信息，但难以提取模态间联系与不同模态样本的共性，且大多数研究都在着重寻找图像及文本间的联系，而忽略了视频以及音频内蕴含的大量信息。Existing cross-modal retrieval methods mainly focus on the modality of image-text pairs, while there are few studies on the four modalities of image, video, audio and text. Multimodal retrieval means that the user gives a sample of any modality as a query sample, and the system retrieves and feedbacks samples of each modality that belong to the same category as the query sample. Multimodal retrieval generally requires that the number of modalities is greater than or equal to two, which also brings great difficulty to the experiment. There are generally two methods in the traditional cross-modal fine-grained retrieval model based on deep learning. One is to use different neural network models to extract feature vectors for different modalities, such as using an image feature extractor and a linear classifier to predict some labels, and jointly train an image encoder and a text encoder to predict the correct pairing of a batch of (image, text) training samples. At test time, the learned text encoder synthesizes a zero-shot linear classifier by embedding the name or description of the class of the target dataset. Another method is to use a backbone network to extract feature vectors of different modalities at the same time, such as using ResNet50 as the basic deep model, with 448*448 as the input size, and after the last convolutional layer, passing through an average pooling layer with a kernel size of 14 and a step of 1. The purpose is to use one network to extract features of all modalities. Some methods propose a dual-path attention model for feature learning, which integrates deep convolutional neural networks, attention mechanisms, and recurrent neural networks to learn cross-modal fine-grained salient features and fully explore the fine-grained semantic correlation between different modal data. Other methods apply strip pooling to image and text modalities, which is a lightweight spatial attention mechanism for capturing modal space semantic information, while exploring second-order covariance pooling to obtain multimodal semantic representations, capture modal channel semantic information, and achieve semantic alignment between image and text modalities. There are also methods that extract salient image region features and word features with context, respectively, construct fine-grained similarity matrices between image regions and text words, as well as image latent spaces and text latent spaces based on attention mechanisms, so as to perform semantic supervision on image region features and corresponding image context features. The method of using a backbone network to extract features focuses on extracting the connections and commonalities between modalities, but the connections and commonalities are only a small part of all the data, resulting in a large loss of effective modal-specific information. The method of using a branch network to extract features of each modality focuses on utilizing modal-specific information, but it is difficult to extract the connections between modalities and the commonalities of samples of different modalities. In addition, most studies focus on finding the connections between images and texts, while ignoring the large amount of information contained in videos and audios.

发明内容Summary of the invention

本发明的目的是解决现有方法各个模态之间的异构鸿沟和语义鸿沟的问题，提供基于多信道融合的跨模态细粒度检索网络。The purpose of the present invention is to solve the problems of heterogeneous gap and semantic gap between various modalities in existing methods, and to provide a cross-modal fine-grained retrieval network based on multi-channel fusion.

为实现上述目的，本发明采用的技术方案如下：To achieve the above purpose, the technical solution adopted by the present invention is as follows:

1)、将图像模态数据按照已有的边界框标签进行裁剪，将视频模态数据按照每十帧抽取一帧的方式抽取关键帧，将音频模态数据使用短时傅里叶变换转换为频谱图。1) The image modality data is cropped according to the existing bounding box labels, the key frames are extracted from the video modality data by extracting one frame every ten frames, and the audio modality data is converted into a spectrogram using short-time Fourier transform.

2)、将已转化为图片的图像模态、视频模态和音频模态的数据使用ResNet50网络提取特征，最终得到三个模态的模态特异特征，Res(f_I)，Res(f_V)，Res(f_A)。2) The image modality, video modality and audio modality data that have been converted into pictures are used to extract features using the ResNet50 network, and finally the modality-specific features of the three modalities are obtained, Res(f _I ), Res(f _V ), and Res(f _A ).

3)、使用TextCNN提取文本模态数据的特征，最终得到文本模态的模态特异特征Text(f_T)。3) Use TextCNN to extract the features of text modality data, and finally obtain the modality-specific features Text(f _T ) of the text modality.

4)、将每种模态的模态特异特征分别等分为四个部分。4) Divide the modal specific characteristics of each mode into four equal parts.

5)、使用多信道融合方法将Res(f_I)，Res(f_V)，Res(f_A)，Text(f_T)四个特征进行跨信道融合得到新的模态特异特征f'_I，f'_V，f'_A，f'_T。5) Use the multi-channel fusion method to cross-channel fuse the four features Res(f _I ), Res(f _V ), Res(f _A ), and Text(f _T ) to obtain new modality-specific features f' _I , f' _V , f' _A , and f' _T .

6)、将新的模态特异特征输入线性分类器进行分类，并使用交叉熵损失函数和噪声对比估计损失函数来优化模型效果。6) Input the new modality-specific features into the linear classifier for classification, and use the cross entropy loss function and noise contrast estimation loss function to optimize the model effect.

7)、在已分类好的结果中输入任一模态的任意类别，根据分类检索相同类别的其他模态样例，实现跨模态细粒度检索。7) Enter any category of any modality in the classified results, and retrieve other modal samples of the same category based on the classification to achieve cross-modal fine-grained retrieval.

具体地，所述步骤1)包括：Specifically, the step 1) includes:

对于图像数据，由于我们仅需要整幅图像上与检索对象相关的部分，为了消除背景噪声的干扰，我们对图像进行了裁剪。根据检索对象的像素坐标进行裁剪而得到仅包含对象的部分。For image data, since we only need the part of the whole image related to the retrieval object, in order to eliminate the interference of background noise, we crop the image. Cropping is performed according to the pixel coordinates of the retrieval object to obtain the part containing only the object.

对于视频数据，我们每个视频提取十帧图像来作为视频模态的输入样本。For video data, we extract ten frames of images from each video as input samples of the video modality.

对于音频数据，使用短时傅里叶变换将其转化为对应的频谱图作为音频模态的输入样本。For audio data, short-time Fourier transform is used to convert it into the corresponding spectrogram as the input sample of the audio modality.

进一步地，步骤3)包括：Further, step 3) includes:

首要获取文本的局部特征，通过不同的卷积核尺寸来提取文本的N-Gram信息，然后通过最大池化操作来突出各个卷积操作提取的最关键信息，拼接后通过全连接层对特征进行组合，最后通过交叉熵损失函数来训练模型。First, the local features of the text are obtained, and the N-Gram information of the text is extracted through different convolution kernel sizes. Then, the maximum pooling operation is used to highlight the most critical information extracted by each convolution operation. After splicing, the features are combined through the fully connected layer, and finally the model is trained through the cross entropy loss function.

进一步地，步骤5)包括：Further, step 5) comprises:

一般来说，跨模态细粒度检索最大的挑战在于模态间的异构鸿沟和语义鸿沟，因此本发明使用信道融合的的方法在训练过程中将四种模态的深层特征融合重组，Generally speaking, the biggest challenge of cross-modal fine-grained retrieval lies in the heterogeneous gap and semantic gap between modalities. Therefore, the present invention uses the channel fusion method to fuse and reorganize the deep features of the four modalities during the training process.

为了实现多信道融合的跨模态检索方法，其中图像第一部分与视频模态的第一部分替换，图像模态第二部分与音频模态第二部分替换，图像模态的第三部分与文本模态第三部分替换，图像模态的第四部分与其他模态的其余部分均保持不变。In order to realize the cross-modal retrieval method of multi-channel fusion, the first part of the image is replaced with the first part of the video modality, the second part of the image modality is replaced with the second part of the audio modality, the third part of the image modality is replaced with the third part of the text modality, and the fourth part of the image modality and the rest of the other modalities remain unchanged.

进一步地，步骤6)包括：Further, step 6) includes:

由于本发明针对的问题是对于同属于鸟这一大类的200中不同鸟类的细粒度跨模态检索，分类任务十分繁重，因此需要使用噪声对比估计损失函数来优化模型。Since the problem addressed by the present invention is fine-grained cross-modal retrieval of 200 different bird species belonging to the same large category of birds, the classification task is very arduous, and therefore it is necessary to use a noise contrast estimation loss function to optimize the model.

跨模态细粒度检索器根据步骤5)得到的特征进行分类，得到类别预测损失：The cross-modal fine-grained retriever classifies the features obtained in step 5) and obtains the category prediction loss:

其中，l(x_k,y_k)是交叉熵损失函数，I,V,A,T代表图像模态，视频模态，音频模态和文本模态。Among them, l(x _k ,y _k ) is the cross entropy loss function, I, V, A, T represent image modality, video modality, audio modality and text modality.

由于本发明的检索任务是先分类，后聚类，最后进行检索任务，为了减少200个分类带来的巨大的计算量，本发明又使用噪声对比估计损失函数来减少计算量：Since the retrieval task of the present invention is to classify first, then cluster, and finally perform the retrieval task, in order to reduce the huge amount of calculation caused by 200 classifications, the present invention uses the noise contrast estimation loss function to reduce the amount of calculation:

G(x,y)＝F(x,y)-logQ(y|x)G(x,y)＝F(x,y)-logQ(y|x)

其中F(x,y)代表x,y之间的匹配度，即模型的输出，x,y均表示预测正确的实例，y'代表x的负例，取自整个集合y的集合。因此总的损失函数将被定义为L_total＝αL_cls+βL_NCE。Where F(x,y) represents the match between x and y, that is, the output of the model, x and y both represent correctly predicted instances, and y' represents the negative example of x, which is taken from the entire set y. Therefore, the total loss function will be defined as L _total = αL _cls + βL _NCE .

经过上述步骤1)到7)，对于用户输入的某个模态的某个类别的用例，系统会检索得到并反馈与该用例同属一个类别的其他模态的用例。After the above steps 1) to 7), for a use case of a certain category of a certain modality input by the user, the system will retrieve and feedback use cases of other modalities belonging to the same category as the use case.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

1)、对于图片模态的输入对其按照边界框进行了裁剪，消除了背景噪声的干扰。1) The input of the image modality is cropped according to the bounding box to eliminate the interference of background noise.

2)、将图像、视频、音频模态的数据信息均转换成了图片，使网络处理更加便捷，结构更为统一。2) Convert image, video and audio modal data information into pictures, making network processing more convenient and the structure more unified.

3)、引入交叉熵损失函数和噪声对比估计损失函数，引导网络结构参数进行有效更新，使用噪声对比估计损失函数可以有效减少计算量，提高效率。3) Introduce the cross entropy loss function and the noise contrast estimation loss function to guide the effective update of the network structure parameters. The use of the noise contrast estimation loss function can effectively reduce the amount of calculation and improve efficiency.

4)、通过多信道融合技术，加强了模态间的信息交互，减少模态间的异构鸿沟和语义鸿沟，使模型学习到更加丰富的信息。4) Through multi-channel fusion technology, the information interaction between modalities is strengthened, the heterogeneous gap and semantic gap between modalities are reduced, and the model can learn richer information.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法流程示意图。FIG1 is a schematic flow chart of the method of the present invention.

图2为本发明的模型框架示意图。FIG. 2 is a schematic diagram of a model framework of the present invention.

图3为本发明在实验中所使用的数据集。FIG3 is a data set used in the experiment of the present invention.

图4为本发明在单对单模态检索与其他方法在PKU FGXmedia数据集上的性能对比图。FIG4 is a performance comparison chart of the present invention and other methods in one-to-one modal retrieval on the PKU FGXmedia dataset.

图5为本发明在单对多模态检索与其他方法在PKU FGXmedia数据集上的性能对比图。FIG5 is a performance comparison chart of the present invention and other methods in single-to-multimodal retrieval on the PKU FGXmedia dataset.

图6为本发明的不同变体在单对单模态检索的性能对比图。FIG6 is a performance comparison diagram of different variants of the present invention in single-to-single modality retrieval.

图7为本发明的不同变体在单对多模态检索的性能对比图。FIG. 7 is a performance comparison diagram of different variants of the present invention in single-to-multimodal retrieval.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明提出基于单模态引导和多信道融合的跨模态细粒度检索网络，方法的主要流程如图1所示。具体实施过程如下：The present invention proposes a cross-modal fine-grained retrieval network based on single-modal guidance and multi-channel fusion. The main process of the method is shown in Figure 1. The specific implementation process is as follows:

4)、将每种模态的模态特异特征分别等分为四个信道。4) Divide the modal specific features of each mode into four channels.

6)、将新的模态特异特征输入线性分类器进行分类，并使用交叉熵损失函数和噪声对比估计损失函数和细粒度跨模态中心损失来优化模型效果。6) Input the new modality-specific features into the linear classifier for classification, and use the cross entropy loss function, noise contrast estimation loss function and fine-grained cross-modal center loss to optimize the model effect.

具体地，所述步骤2)包括：Specifically, the step 2) includes:

相比于CNN网络，我们使用较为先进的ResNet50网络处理图片输入，ResNet50的主要特点是它的深度，它有50层卷积神经网络。这个模型的深度使得它能够学习更复杂的特征，从而提高了它的准确率。但是，深度学习模型的一个问题是梯度消失，这会导致模型无法训练。为了解决这个问题，ResNet50使用了残差学习的方法。残差学习的思想是，如果一个层的输入和输出相同，那么这个层就是一个恒等映射。如果一个层的输入和输出不同，那么这个层就是一个残差映射。ResNet50使用了残差块来实现残差学习。每个残差块包含两个卷积层和一个跳跃连接。跳跃连接将输入直接传递到输出，从而避免了梯度消失的问题。ResNet50的另一个特点是它使用了全局平均池化层。这个层将每个特征图的所有像素的平均值作为该特征图的输出。这个层的作用是减少模型的参数数量，从而减少过拟合的风险。总的来说，ResNet50是一种非常强大的深度学习模型。它的深度和残差学习的方法使得它能够学习更复杂的特征，从而提高了它的准确率。它的全局平均池化层可以减少模型的参数数量，从而减少过拟合的风险。ResNet50已经被广泛应用于计算机视觉领域，例如图像分类、目标检测和语义分割等任务。Compared with the CNN network, we use the more advanced ResNet50 network to process image input. The main feature of ResNet50 is its depth. It has 50 layers of convolutional neural networks. The depth of this model enables it to learn more complex features, thereby improving its accuracy. However, one problem with deep learning models is the disappearance of gradients, which can cause the model to be unable to train. To solve this problem, ResNet50 uses the residual learning method. The idea of residual learning is that if the input and output of a layer are the same, then the layer is an identity mapping. If the input and output of a layer are different, then the layer is a residual mapping. ResNet50 uses residual blocks to implement residual learning. Each residual block contains two convolutional layers and a skip connection. The skip connection passes the input directly to the output, thereby avoiding the problem of gradient disappearance. Another feature of ResNet50 is that it uses a global average pooling layer. This layer takes the average of all pixels of each feature map as the output of the feature map. The role of this layer is to reduce the number of parameters of the model, thereby reducing the risk of overfitting. In general, ResNet50 is a very powerful deep learning model. Its depth and residual learning methods enable it to learn more complex features, thereby improving its accuracy. Its global average pooling layer can reduce the number of parameters of the model, thereby reducing the risk of overfitting. ResNet50 has been widely used in the field of computer vision, such as image classification, object detection, and semantic segmentation.

进一步地，步骤5)包括：Further, step 5) comprises:

已知有四种模态M＝{I,V,A,T}，其中I表示图像模态，V表示视频模态，A表示音频模态，T表示文本模态。假设类别空间是K＝{k₁,k₂,k₃,...k_n}，样本空间是其中/>是一个实例的集合，表示模态M属于类别k_i。在进行训练时，我们随机选择一个类别k_m，对M中的每一个模态我们在/>中随机选择一个实例来构建一个图像-视频-音频-文本的多模态数据集/>其中/>是类别k_m中随机选择出的一个图像样本，/>是类别k_m中随机选择出的一个视频样本，/>是类别k_m中随机选择出的一个音频样本，/>是类别k_m中随机选择出的一个图像样本，即/>随后我们将/>作为网络的输入，使图像、视频、音频模态通过Vision Transformer(ViT)主干网络，文本模态通过bert主干网络提取到深度特征信息/>其中There are four known modalities M = {I, V, A, T}, where I represents image modality, V represents video modality, A represents audio modality, and T represents text modality. Assume that the category space is K = {k ₁ , k ₂ , k ₃ , ... k _n }, and the sample space is Where/> is a set of instances, indicating that the modality M belongs to the category k _i . During training, we randomly select a category k _m , and for each modality in M we Randomly select an instance to build a multimodal dataset of image-video-audio-text/> Where/> is a randomly selected image sample from category k _m , /> is a randomly selected video sample from category k _m , /> is an audio sample randomly selected from category k _m , /> is a randomly selected image sample from category k _m , that is, /> We will then As the input of the network, the image, video, and audio modalities are passed through the Vision Transformer (ViT) backbone network, and the text modality is extracted through the Bert backbone network to obtain deep feature information/> in

其中d是本发明划分的特征尺寸，在本发明中各个模态特征尺寸相同即d^I＝d^V＝d^A＝d^T＝d。Wherein d is the characteristic size divided by the present invention. In the present invention, the characteristic sizes of each mode are the same, that is, d ^I =d ^V =d ^A =d ^T =d.

为了实现多信道融合的跨模态检索方法，我们首先将已提取到的特征信息表示为随后将每个模态的特征信息分为四个部分，以图像模态为例In order to realize the cross-modal retrieval method of multi-channel fusion, we first represent the extracted feature information as Then the feature information of each modality is divided into four parts, taking the image modality as an example

其他三个模态均按此方式处理，其中L₁，L₂，L₃，L₄代表每个部分的长度，且L₁+L₂+L₃+L₄＝d，随后随后我们进行多信道特征融合操作，融合后的图像-视频-音频-文本四模态深度特征信息表示The other three modalities are processed in the same way, where L ₁ , L ₂ , L ₃ , L ₄ represent the length of each part, and L ₁ + L ₂ + L ₃ + L ₄ = d. Then we perform a multi-channel feature fusion operation, and the fused image-video-audio-text quad-modal deep feature information is represented as

为了使模型达到更好的效果，将一个Squeeze-and-Excitation Networks(SENet)应用于混合后深度特征信息，使得有效的特征图通道权重大，无效或者效果小的特征图通道权重小。最后得到全新的图像-视频-音频-文本四模态深度特征信息将此作为输入放入一个线性分类器得到最后的预测值。In order to achieve better results for the model, a Squeeze-and-Excitation Network (SENet) is applied to the mixed deep feature information, so that the effective feature map channels have large weights, and the invalid or ineffective feature map channels have small weights. Finally, a new image-video-audio-text quad-modal deep feature information is obtained. This is fed into a linear classifier as input to get the final prediction.

实验验证Experimental verification

1、数据集1. Dataset

本发明使用的数据集是目前唯一的跨四模态细粒度检索公开数据集FG-Xmedia，该数据集包含了图像、视频、音频、文本四个模态的数据。其中图像数据来源为CUB-200-2011，其为使用最广泛的细粒度图像分类数据集，包含了200个子类别的11788张图像，属于与“Bird”相同的基本级粗粒度类别，其中训练集包含5994张图像，测试集包含5794张图像，每个图像注释包含一个图像级别的子类别标签，对象的边界框，15个部分位置和312个二进制属性。视频来源为YouTube Birds，其为一个新的细粒度视频数据集。使用与CUB-200-2011相同的类别划分方式，其中训练集包括12666个视频，测试集包括5684个视频。音频及文本数据是根据已划分的类别从专业网站上获取，以上四个模态共同组成了公开数据集FG-Xmedia。The data set used in the present invention is the only public data set FG-Xmedia for fine-grained retrieval across four modalities. The data set contains data in four modalities: image, video, audio, and text. The image data comes from CUB-200-2011, which is the most widely used fine-grained image classification data set. It contains 11,788 images of 200 subcategories, belonging to the same basic coarse-grained category as "Bird", wherein the training set contains 5,994 images, and the test set contains 5,794 images. Each image annotation contains an image-level subcategory label, the bounding box of the object, 15 partial positions, and 312 binary attributes. The video source is YouTube Birds, which is a new fine-grained video data set. The same category division method as CUB-200-2011 is used, wherein the training set includes 12,666 videos, and the test set includes 5,684 videos. The audio and text data are obtained from professional websites according to the divided categories. The above four modalities together constitute the public data set FG-Xmedia.

2、实施细节2. Implementation details

本发明主要使用了两个主干网络，由于数据预处理阶段将图像、视频和音频模态的输入都转化为了图像，所以他们共同使用ResNet50作为主干网络。对于文本模态使用TextCNN提取特征。四种模态的数据经过预处理后，数据样本的维数固定为448×448×3，整个程序代码由Pytorch编写，显卡为RTX 3090。在训练时，我们将batchsize设置为8，初始学习率设置为0.00005，选择AdamW作为优化器，学习率调整为热身1000步的余弦学习率。The present invention mainly uses two backbone networks. Since the inputs of image, video and audio modalities are converted into images in the data preprocessing stage, they jointly use ResNet50 as the backbone network. TextCNN is used to extract features for the text modality. After preprocessing of the data of the four modalities, the dimension of the data samples is fixed to 448×448×3. The entire program code is written in Pytorch, and the graphics card is RTX 3090. During training, we set the batchsize to 8, the initial learning rate to 0.00005, select AdamW as the optimizer, and adjust the learning rate to the cosine learning rate with 1000 steps of warm-up.

3、对比实验3. Comparative experiment

本发明将所提出的方法与一些具有代表性的模型进行了比较，本发明将所提出的方法命名为MCF。The present invention compares the proposed method with some representative models and names the proposed method MCF.

(1)ACMR：一种新颖的对抗性跨模态检索(ACMR)方法，该方法基于对抗学习寻求有效的公共子空间。(1) ACMR: A novel Adversarial Cross-Modal Retrieval (ACMR) method that seeks an effective common subspace based on adversarial learning.

(2)CMDN：一种跨媒体多重深度网络，通过分层学习来利用复杂的跨媒体关联，通过两个阶段学习丰富的跨媒体相关性，并最终通过堆叠网络工作方式获得共享表征。(2) CMDN: A cross-media multi-deep network that exploits complex cross-media associations through hierarchical learning, learns rich cross-media correlations in two stages, and finally obtains shared representations through a stacked network working approach.

(3)JRL：一种新颖的跨媒体数据特征学习算法，即联合表征学习(JRL)，它能在统一的优化框架下联合探索相关信息和语义信息。。(3) JRL: A novel feature learning algorithm for cross-media data, namely Joint Representation Learning (JRL), can jointly explore relevant information and semantic information under a unified optimization framework.

(4)GSPH：一种简单而有效的通用散列框架，它可以适用于所有不同的情况，同时保留数据点之间的语义距离，该方法首先同时学习两种模态的最佳哈希码，以保持数据点之间的语义相似性，然后学习哈希函数，将特征映射到哈希码。(4) GSPH: A simple yet effective general hashing framework that can be applied to all different situations while preserving the semantic distance between data points. This method first simultaneously learns the optimal hash codes of the two modalities to preserve the semantic similarity between data points, and then learns a hash function to map features to hash codes.

(5)MHTN：一种模态-反向混合传输网络(MHTN)，旨在实现从单模态源域到跨模态目标域的知识传输，并学习跨模态共同表征。(5) MHTN: A modality-backward hybrid transfer network (MHTN) aims to achieve knowledge transfer from a unimodal source domain to a cross-modal target domain and learn cross-modal joint representations.

(6)FGCrossNet：一个统一的深度模型，它同时学习4种类型的媒体，而不进行分别处理。共同考虑了三个约束:分类约束保证了细粒度子类别的判别特征的学习，中心约束保证了同一子类别特征的紧密性特征，排序约束保证了不同子类别特征的稀疏性特征。(6) FGCrossNet: A unified deep model that learns four types of media simultaneously without processing them separately. Three constraints are considered together: the classification constraint ensures the learning of discriminative features of fine-grained subcategories, the center constraint ensures the compactness of features of the same subcategory, and the sorting constraint ensures the sparsity of features of different subcategories.

(7)DBFC-Net：一种新颖的双分支细粒度跨媒体网络(DBFC-Net)，其利用特定媒体信息通过统一框架构建共同特征的工作。其还为细粒度跨媒体检索设计了一种有效的距离度量。(7) DBFC-Net: A novel dual-branch fine-grained cross-media network (DBFC-Net) that exploits media-specific information to build common features through a unified framework. It also designs an effective distance metric for fine-grained cross-media retrieval.

(8)SAFGCM：一种注意力空间训练方法来学习不同媒体数据的共同表征。具体来说，利用局部自注意力层来学习不同媒体数据之间的共同注意力空间。(8) SAFGCM: An attention space training method to learn the common representation of different media data. Specifically, a local self-attention layer is used to learn the common attention space between different media data.

为了公平比较，本发明使用多模态细粒度检索任务统一的评价指标，即平均精度均值(mean Average Precision,mAP)。在本实验中，平均精度均值指的是先计算每一个类别的查询样例的平均准确度，然后计算200个类别的平均准确度的平均值得到mAP。For fair comparison, this paper uses a unified evaluation index for multimodal fine-grained retrieval tasks, namely mean average precision (mAP). In this experiment, mean average precision refers to first calculating the average accuracy of query samples in each category, and then calculating the average of the average accuracy of 200 categories to obtain mAP.

同时为了更加全面的展示跨模态细粒度检索的效果，我们一共设置了18个评价指标，其中包括了每一个模态对其他三个模态单独的检索分数，分别表示为I→V，I→A，I→T，V→I，V→A，V→T，A→I，A→V，A→T，T→I，T→V，T→A，以及上述分数的平均值。还包括每一个模态对其他所有模态共同的检索分数即I→ALL，V→ALL，A→ALL，T→ALL，以及上述分数的平均值。观察图5中数据可以发现，我们的算法与其他方法相比性能有显著提升，在I→T，T→I，T→A，T→V，A→T，V→T，T→ALL场景中相较于FGCrossNet算法提升超过110％。整体上一对一模态检索的平均检索mAP值相较FGCrossNet算法提升了52％，一对多模态检索的平均检索mAP值相较FGCrossNet算法提升了33％，这证明了MCF在跨模态细粒度检索方面的有效性，能够充分融合不同模态的信息，增强细粒度对象在不同模态下的信息交互，减少不同模态间的差异，充分利用文本模态的丰富语义信息，减少信息丢失等，从而能够更好地提升检索效果。At the same time, in order to more comprehensively demonstrate the effect of cross-modal fine-grained retrieval, we set a total of 18 evaluation indicators, including the individual retrieval scores of each modality for the other three modalities, respectively expressed as I→V, I→A, I→T, V→I, V→A, V→T, A→I, A→V, A→T, T→I, T→V, T→A, and the average of the above scores. It also includes the common retrieval scores of each modality for all other modalities, namely I→ALL, V→ALL, A→ALL, T→ALL, and the average of the above scores. Observing the data in Figure 5, it can be found that our algorithm has significantly improved performance compared with other methods, and the improvement is more than 110% compared with the FGCrossNet algorithm in the I→T, T→I, T→A, T→V, A→T, V→T, and T→ALL scenarios. Overall, the average retrieval mAP value of one-to-one modal retrieval has increased by 52% compared with the FGCrossNet algorithm, and the average retrieval mAP value of one-to-multimodal retrieval has increased by 33% compared with the FGCrossNet algorithm. This proves the effectiveness of MCF in cross-modal fine-grained retrieval, which can fully integrate information from different modalities, enhance the information interaction of fine-grained objects in different modalities, reduce the differences between different modalities, make full use of the rich semantic information of the text modality, reduce information loss, etc., thereby better improving the retrieval effect.

4、消融实验4. Ablation experiment

由于本发明提出的MCF包含多个关键组件，在本节中本发明将从以下几个方面对MCF的变体进行比较，以证明MCF的有效性：Since the MCF proposed in the present invention contains multiple key components, in this section, the present invention will compare the variants of MCF from the following aspects to demonstrate the effectiveness of MCF:

(1)ResNet50：使用ResNet网络提取特征后，只使用交叉熵损失函数和噪声对比估计损失函数来优化模型。(1) ResNet50: After extracting features using the ResNet network, only the cross entropy loss function and the noise contrast estimation loss function are used to optimize the model.

(2)ResNet+MCF：使用大部分网络使用的卷积神经网络提取特征后，再添加多信道融合模块进行后续处理。(2) ResNet+MCF: After using the convolutional neural network used by most networks to extract features, a multi-channel fusion module is added for subsequent processing.

(3)ResNet+MCF+FCCL：使用目前最先进的图像特征提取器和文本特征提取器提取特征后直接进行分类，同时添加单模态引导和多信道融合模块用于提高模型效果。(3) ResNet+MCF+FCCL: It uses the most advanced image feature extractor and text feature extractor to extract features and then directly performs classification. At the same time, it adds single-modal guidance and multi-channel fusion modules to improve the model effect.

图5展示了MCF的各个变体在PKU FG-Xmedia数据集上的效果。Figure 5 shows the effects of various variants of MCF on the PKU FG-Xmedia dataset.

从图5中不难看出，ResNet+MCF+FCCL的性能要明显由于ResNet+MCF、ResNet50，这说明了每个子板块对MCF整体模型的有效性。此外可以观察到，在三种变体中，去除MCF的性能下降得最为显著，这说明信道融合对MCF模型的性能影响最大，使用多信道融合技术可以极大增强模态间的信息交互，消除异构鸿沟和语义鸿沟，提高检索性能。It is not difficult to see from Figure 5 that the performance of ResNet+MCF+FCCL is significantly better than ResNet+MCF and ResNet50, which shows the effectiveness of each submodule on the overall MCF model. In addition, it can be observed that among the three variants, the performance of removing MCF drops most significantly, which shows that channel fusion has the greatest impact on the performance of the MCF model. The use of multi-channel fusion technology can greatly enhance the information interaction between modalities, eliminate the heterogeneous gap and semantic gap, and improve retrieval performance.

Claims

1. A cross-mode fine granularity retrieval method based on multi-channel fusion is characterized in that: the method comprises the following steps: 1) Clipping the image mode data according to the existing boundary frame label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform; 2) Extracting features from the data converted into image mode, video mode and audio mode of the picture by ResNet network to obtain mode specific features of three modes finally, res (f _I),Res(f_V),Res(f_A); 3) Extracting characteristics of the Text mode data by TextCNN to finally obtain mode specific characteristics Text (f _T) of the Text mode; 4) Dividing the mode specific characteristics of each mode into four parts; 5) Performing cross-channel fusion on four features of Res (f _I),Res(f_V),Res(f_A),Text(f_T) by using a multi-channel fusion method to obtain a new mode specific feature f' _I,f'_V,f'_A,f'_T; 6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function and a noise contrast estimated loss function; 7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.

2. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 1) comprises: the input data is clipped by using the existing bounding box labels of the image mode, the interference of background noise is eliminated, meanwhile, in order to process video and audio information by using ResNet, the video is subjected to frame extraction processing, and the audio is converted into pictures by using short-time Fourier transform processing, so that the pictures are taken as the input of ResNet.

3. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 3) comprises: to obtain local features of a text, N-Gram information of the text is extracted through different convolution kernel sizes, then the most critical information extracted by each convolution operation is highlighted through a maximum pooling operation, the features are combined through a full connection layer after splicing, and finally a model is trained through a cross entropy loss function.

4. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 5) comprises: for the extracted mode specific features, the invention divides the features of each mode into four channels, and then fusion recombination is carried out to obtain a new mode specific feature f' _I,f'_V,f'_A,f'_T.

5. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 6) comprises: the cross-mode fine granularity retriever predicts which category the input case specifically belongs to according to the characteristics obtained in the step 5), obtains the category prediction loss of each mode, classifies the case according to the category under the condition of not considering the mode by minimizing the category prediction loss improvement model, and provides preconditions for subsequent retrieval tasks.

6. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 7) comprises: the cross-modal fine-grained retriever retrieves according to the classification result obtained in the step 6), and the classification in the step 6) is carried out according to the classification without considering the modes, so that the use cases of the same class in different modes can be divided together, after a user inputs a certain class in a certain mode, the system can retrieve and feed back the use cases of other modes which belong to the same class as the input sample, and the cross-modal fine-grained retrieval task is realized.