CN117392582A

CN117392582A - Multi-mode video classification method and system

Info

Publication number: CN117392582A
Application number: CN202311329631.3A
Authority: CN
Inventors: 王卫跃; 林菲
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-12

Abstract

The invention discloses a multi-modal video classification method based on parallel ResNet18 of speech and visual modes and a global linkage gating mechanism, which includes the following steps: S1: Use two resnet18 encoders with consistent structures to extract speech respectively. Feature representation of image modalities and used as input features for gated cross-modal feature fusion. S2: Design a globally linked gating mechanism, which consists of a cross-modal gating fusion module and an objective function based on gating auxiliary loss. It is the core of balancing different modalities. S3: The first part is a cross-modal gating fusion module designed based on the GRU gating principle. It accepts the feature input of voice and image modalities, automatically adjusts the input proportions of different modalities, and further performs interactive fusion, and outputs the fused Cross-modal features. S4: In the second part, the single-modal loss is used as the auxiliary loss, and the gating parameters in the first part are processed to a certain extent and used as the weight of the auxiliary loss to form the adaptive gating adjustment mechanism of the model.

Description

A multi-modal video classification method and system

技术领域Technical field

本发明涉及多模态视频分类技术领域，具体指一种多模态视频分类方法及其系统。The present invention relates to the technical field of multi-modal video classification, and specifically refers to a multi-modal video classification method and its system.

背景技术Background technique

多模态学习发展至今已进入多模态深度学习时期，近年来深度学习(DeepLearning，DL)在图像识别、机器翻译、情感分析、自然语言处理(Natural LanguageProcessing，NLP)等领域得到广泛应用并取得较多研究成果，为使机器能更全面高效地感知周围的世界，需要赋予其理解、推理及融合多模态信息的能力，并且由于人们生活在一个多领域相互交融的环境中，听到的声音、看到的实物、闻到的味道都是一种模态，因此研究人员开始关注如何将多领域数据进行融合实现异质互补，例如语音识别的研究表明，视觉模态提供了嘴的唇部运动和发音信息，包括张开和关闭，有助于提高语音识别性能。可见，利用多种模式的综合语义对深度学习研究具有重要意义。The development of multi-modal learning has entered the period of multi-modal deep learning. In recent years, deep learning (Deep Learning, DL) has been widely used and achieved great results in image recognition, machine translation, sentiment analysis, natural language processing (NLP) and other fields. Many research results show that in order for machines to perceive the world around them more comprehensively and efficiently, they need to be given the ability to understand, reason and fuse multi-modal information. And because people live in an environment where multiple fields are integrated, what they hear is Sound, physical objects seen, and smells are all modalities, so researchers have begun to focus on how to fuse multi-domain data to achieve heterogeneous complementarity. For example, research on speech recognition shows that the visual modality provides the lips of the mouth. Internal motion and articulatory information, including opening and closing, help improve speech recognition performance. It can be seen that using comprehensive semantics of multiple modes is of great significance to deep learning research.

与单模态数据相比，多模态数据通常提供更多的信息，因此，使用多模态数据进行学习应该匹配或优于单模态数据。然而，在某些情况下，使用联合训练策略优化所有模态的统一学习目标的多模态模型可能不如单模态模型。这种现象是由于各种模态往往以不同的速度收敛，导致一个模态达到拟合状态而其他模态还未拟合，即模态不平衡问题。Multimodal data generally provide more information than unimodal data, so learning with multimodal data should match or outperform unimodal data. However, in some cases, multimodal models that use a joint training strategy to optimize a unified learning goal for all modalities may be inferior to single-modal models. This phenomenon is due to the fact that various modes tend to converge at different speeds, resulting in one mode reaching the fitting state while other modes have not yet been fitted, which is a modal imbalance problem.

发明内容Contents of the invention

本发明针对现有技术的不足，提出一种多模态视频分类方法及其系统，不仅有效融合多，并且很好的解决多模态不平衡问题。In view of the shortcomings of the existing technology, the present invention proposes a multi-modal video classification method and its system, which not only effectively integrates multiple video types, but also effectively solves the problem of multi-modal imbalance.

为了解决上述技术问题，本发明的技术方案为：In order to solve the above technical problems, the technical solution of the present invention is:

一种多模态视频分类方法，基于语音和视觉模态并行的ResNet18和全局联动的门控机制，包括以下步骤：A multi-modal video classification method based on parallel ResNet18 of speech and visual modalities and a global linkage gating mechanism, including the following steps:

S1：使用两个结构一致的resnet18编码器，分别提取语音，图像模态的特征表示，并将其作为门控跨模态特征融合的输入特征。S1: Use two resnet18 encoders with the same structure to extract feature representations of speech and image modalities respectively, and use them as input features for gated cross-modal feature fusion.

S2：设计一个全局联动的门控机制，其由跨模态门控融合模块和基于门控辅助loss的目标函数两部分组成，是平衡不同模态的核心所在。S2: Design a globally linked gating mechanism, which consists of a cross-modal gating fusion module and an objective function based on gating auxiliary loss. It is the core of balancing different modalities.

S3：第一部分是借鉴GRU门控原理设计的跨模态门控融合模块，接受语音和图像模态的特征输入，并自动调节不同模态的输入占比再进一步进行交互融合，输出融合后的跨模态特征。S3: The first part is a cross-modal gating fusion module designed based on the GRU gating principle. It accepts the feature input of voice and image modalities, automatically adjusts the input proportions of different modalities, and further performs interactive fusion, and outputs the fused Cross-modal features.

S4：第二部分将单一模态的loss作为辅助loss，同时将第一部分中的门控参数进行一定处理后作为辅助loss的权重，形成模型的自适应门控调节机制。S4: In the second part, the single-modal loss is used as the auxiliary loss, and the gating parameters in the first part are processed to a certain extent and used as the weight of the auxiliary loss to form the adaptive gating adjustment mechanism of the model.

作为优选，步骤S1的具体步骤包括：Preferably, the specific steps of step S1 include:

S101：对相关数据集的原视频数据进行预处理，提取视频的中的语音信息并转换为频谱图作为语音模态的输入，同时考虑到数据集的差异，从视频中均匀采样3帧作为视觉模态的输入；S101: Preprocess the original video data of the relevant data set, extract the speech information in the video and convert it into a spectrogram as the input of the speech modality. At the same time, taking into account the differences in the data sets, uniformly sample 3 frames from the video as visual Modal input;

S102：语音模态和视觉模态各采用两个相同的ResNet18作为编码器分别从语音和视觉模态的输入数据中提取特征：S102: The speech mode and the visual mode each use two identical ResNet18 as encoders to extract features from the input data of the speech and visual modes respectively:

其中是来自于语音(audio)模态的输入，/>是语音模态基于ResNet18的语音编码器，θ^a是编码器参数，H^a是编码器提取后的语音模态特征；同理/>是来自于视觉(visual)模态的输入，/>是视觉模态基于ResNet18的视觉编码器，θ^v是编码器参数，H^v是编码器提取后的视觉模态特征；in Is input from the speech (audio) mode,/> is the speech modality speech encoder based on ResNet18, θ ^a is the encoder parameter, and H ^a is the speech modality feature extracted by the encoder; similarly/> Is input from the visual modality,/> is the visual modality visual encoder based on ResNet18, θ ^v is the encoder parameter, and H ^v is the visual modality feature extracted by the encoder;

作为优选，步骤S2的具体步骤包括：Preferably, the specific steps of step S2 include:

首先利用跨模态融合模块对输入的两个特征进行交互融合，将融合后的跨模态特征H^av、语音模态特征H^a，视觉模态特征H^v进行分类预测后，利用交叉熵公式计算得到三个独立的loss^av、loss^a、loss^v。最后将单一模态的loss作为辅助loss，经过一定处理后的门控参数作为辅助loss的权重，构成基于门控辅助loss的目标函数。First, the cross-modal fusion module is used to interactively fuse the two input features. After the fused cross-modal feature ^Hav , speech modality feature ^Ha , and visual modality feature ^Hv are classified and predicted, the cross-entropy formula is used Three independent losses ^av , loss ^a , and loss ^v are calculated. Finally, the single-modal loss is used as the auxiliary loss, and the gating parameters after certain processing are used as the weight of the auxiliary loss to form an objective function based on the gating auxiliary loss.

作为优选，步骤S3的具体步骤包括：Preferably, the specific steps of step S3 include:

λ＝Sigmod(UH^a+VH^v)λ=Sigmod(UH ^a +VH ^v )

H^av＝(1-λ)·H^a+λ·H^v H ^av = (1-λ)·H ^a +λ·H ^v

其中U和V是可训练变量，λ是门控参数控制有多少视觉信息可以被保留，Sigmod是激活函数，H^av是不同模态的融合特征。Where U and V are trainable variables, λ is the gating parameter controlling how much visual information can be retained, Sigmod is the activation function, ^{and Hav} is the fusion feature of different modalities.

作为优选，步骤S4的具体步骤包括：Preferably, the specific steps of step S4 include:

将第一部分跨模态融合模块中的门控参数λ取平均降维后，以倒数的形式作为相应模态loss的权重，实现了模态loss与模态保留信息的负相关。也就是当某个单一模态loss相对更小时(模态已经达到拟合状态)，模型会相应的降低该模态的信息保留，与此同时另一个单一模态loss相对更大(模态未达到拟合状态)，模型会相应的提高该模态的信息保留；After taking the average dimensionality reduction of the gating parameter λ in the first part of the cross-modal fusion module, the reciprocal form is used as the weight of the corresponding modal loss, achieving a negative correlation between the modal loss and the modal retained information. That is to say, when the loss of a single mode is relatively smaller (the mode has reached the fitting state), the model will correspondingly reduce the information retention of the mode. At the same time, the loss of the other single mode is relatively larger (the mode has not yet reached the fitting state). reaches the fitting state), the model will correspondingly improve the information retention of this mode;

λ^-＝mean(λ)λ ^- =mean(λ)

其中λ^-是取平均后的一维门控参数，β是超参数，loss^a是单一语音模态的交叉熵损失，loss^v是单一视觉模态的交叉熵损失，loss^av是多模态融合后的交叉熵损失。where λ ^- is the averaged one-dimensional gating parameter, β is a hyperparameter, loss ^a is the cross-entropy loss of a single speech modality, loss ^v is the cross-entropy loss of a single visual modality, and loss ^av is multi-modal fusion The final cross-entropy loss.

本发明还提供了一种基于语音和视觉模态并行的ResNet18和全局联动的门控机制多模态视频分类系统，包括并行的多模态特征提取模块、跨模态门控融合模块、基于门控辅助loss的目标函数和分类预测模块；The invention also provides a multi-modal video classification system based on parallel ResNet18 and global linkage of voice and visual modalities and global linkage gating mechanism, including a parallel multi-modal feature extraction module, a cross-modal gating fusion module, and a gate-based gating fusion module. Control the objective function and classification prediction module of auxiliary loss;

所述并行的多模态特征提取模块，用于提取语音、视觉模态的初始特征，并对其进行编码作为模态融合的输入特征；The parallel multi-modal feature extraction module is used to extract initial features of speech and visual modalities, and encode them as input features for modal fusion;

所述跨模态门控融合模块和基于门控辅助loss的目标函数，共同构建了全局的门控调节机制，会根据单一模态的loss判断各模态的拟合状态，并自适应的调节跨模态融合中不同模态的信息占比，解决了多模态模型中的模态不平衡问题；The cross-modal gating fusion module and the objective function based on gating-assisted loss jointly construct a global gating adjustment mechanism, which will judge the fitting status of each modality based on the loss of a single modality and adjust it adaptively. The proportion of information of different modalities in cross-modal fusion solves the problem of modal imbalance in multi-modal models;

所述分类预测模块，对单一模态特征以及融合后模态特征进行分类预测，在模型训练结束后，以融合后模态特征的分类预测作为最终的预测结果。The classification prediction module performs classification prediction on single modal features and fused modal features. After the model training is completed, the classification prediction of the fused modal features is used as the final prediction result.

本发明具有以下的特点和有益效果：The invention has the following characteristics and beneficial effects:

采用上述技术方案，设计了一种通用的全局门控调节结构用于多模态视频分类任务，不仅有效融合多，模态的特征信息，同时更关注模态融合时的不平衡问题。Using the above technical solution, a general global gating adjustment structure is designed for multi-modal video classification tasks, which not only effectively fuses the feature information of multiple modalities, but also pays more attention to the imbalance problem during modal fusion.

此外，为了进一步提高融合模态特征之间的互补性，基于门控辅助loss的目标函数，其简单、巧妙的解决了多模态不平衡问题，有效的提高多模态融合特征的性能。In addition, in order to further improve the complementarity between fused modal features, the objective function based on gated auxiliary loss simply and cleverly solves the multi-modal imbalance problem and effectively improves the performance of multi-modal fused features.

本发明对多模态视频分类任务很有帮助，能够显著提高多模态视频分类任务的分类预测准确率。The present invention is very helpful for multi-modal video classification tasks and can significantly improve the classification prediction accuracy of multi-modal video classification tasks.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.

图1是本发明基于语音和视觉模态并行的ResNet18和全局联动的门控机制多模态视频分类方法的流程图；Figure 1 is a flow chart of the multi-modal video classification method of the present invention based on the parallel ResNet18 of speech and visual modalities and the global linkage gating mechanism;

图2是所述基于语音和视觉模态并行的ResNet18和全局联动的门控机制多模态视频分类方法及其系统的模型示意图；Figure 2 is a schematic model diagram of the multi-modal video classification method and its system based on the parallel ResNet18 and global linkage gating mechanism of speech and visual modalities;

图3是构成ResNet18的单个ResNet基础结构示意图；Figure 3 is a schematic diagram of a single ResNet infrastructure that constitutes ResNet18;

图4是所设计的基于门控的模态融合模块结构示意图；Figure 4 is a schematic structural diagram of the designed gate-based modal fusion module;

图5是所述基于语音和视觉模态并行的ResNet18和全局联动的门控机制多模态视频分类系统的结构框图。Figure 5 is a structural block diagram of the multi-modal video classification system based on the parallel ResNet18 of speech and visual modalities and the global linkage gating mechanism.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", " The orientations or positional relationships indicated by "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. are based on the orientations or positional relationships shown in the drawings, and are only for the convenience of describing the present invention and The simplified description is not intended to indicate or imply that the device or element referred to must have a specific orientation, be constructed and operate in a specific orientation, and therefore should not be construed as a limitation of the present invention. Furthermore, the terms “first”, “second”, etc. are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined by "first," "second," etc. may explicitly or implicitly include one or more of such features. In the description of the present invention, unless otherwise specified, "plurality" means two or more.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以通过具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise clearly stated and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense. For example, it can be a fixed connection or a detachable connection. Connection, or integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium; it can be an internal connection between two components. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood through specific situations.

本发明提供了一种多模态视频分类方法，如图1-图4所示，语音和视觉模态并行的ResNet18和全局联动的门控机制，简称为RGGL，包括以下步骤：The present invention provides a multi-modal video classification method. As shown in Figures 1-4, the parallel ResNet18 of voice and visual modalities and the global linkage gating mechanism, referred to as RGGL, include the following steps:

S1：使用两个结构一致的resnet18编码器，分别提取语音，图像模态的特征表示，并将其作为门控跨模态特征融合的输入特征；S1: Use two resnet18 encoders with the same structure to extract feature representations of speech and image modalities respectively, and use them as input features for gated cross-modal feature fusion;

对数据集数据进行初步处理和特征提取包括以下步骤：Preliminary processing and feature extraction of dataset data includes the following steps:

S101：对相关数据集的原视频数据进行预处理，提取视频的中的语音信息并使用librosa转换为相应的频谱图作为语音模态的输入，同时考虑到数据集的差异，从视频中均匀采样3帧作为视觉模态的输入；S101: Preprocess the original video data of the relevant data set, extract the speech information in the video and use librosa to convert it into the corresponding spectrogram as the input of the speech modality. At the same time, taking into account the differences in the data set, uniformly sample from the video 3 frames serve as input to the visual modality;

S102：语音模态和视觉模态各采用两个相同的ResNet18作为编码器(其中针对语音模态的编码器将ResNet18的输入通道由3改为1，其余部分保持不变。)分别从语音和视觉模态的输入数据中提取特征：S102: The speech mode and the visual mode each use two identical ResNet18 as encoders (the encoder for the speech mode changes the input channel of ResNet18 from 3 to 1, and the rest remains unchanged.) From speech and visual modes respectively Extract features from input data of visual modality:

同时本文使用SGD(0.9momentum，1e-4weight decay)作为优化器，初始学习率为1e-3，每60轮次乘以0.1。At the same time, this article uses SGD (0.9momentum, 1e-4weight decay) as the optimizer, with an initial learning rate of 1e-3, multiplied by 0.1 every 60 rounds.

首先接受语音和视觉模态的特征，经过一个全连接层后相加再由Sigmod函数激活后得到一个多维的门控参数λ。然后由门控参数λ控制语音和视觉模态的信息保留；First, the features of speech and visual modalities are accepted, added through a fully connected layer, and then activated by the Sigmod function to obtain a multi-dimensional gating parameter λ. The information retention of speech and visual modalities is then controlled by the gating parameter λ;

S4：第二部分将单一模态的loss作为辅助loss，同时将第一部分中的门控参数进行一定处理后作为辅助loss的权重，形成模型的自适应门控调节机制。将第一部分跨模态融合模块中的门控参数λ取平均降维后，以倒数的形式作为相应模态loss的权重，实现了模态loss与模态保留信息的负相关。也就是当某个单一模态loss相对更小时(模态已经达到拟合状态)，模型会相应的降低该模态的信息保留，与此同时另一个单一模态loss相对更大(模态未达到拟合状态)，模型会相应的提高该模态的信息保留；S4: In the second part, the single-modal loss is used as the auxiliary loss, and the gating parameters in the first part are processed to a certain extent and used as the weight of the auxiliary loss to form the adaptive gating adjustment mechanism of the model. After taking the average dimensionality reduction of the gating parameter λ in the first part of the cross-modal fusion module, the reciprocal form is used as the weight of the corresponding modal loss, achieving a negative correlation between the modal loss and the modal retained information. That is to say, when the loss of a single mode is relatively smaller (the mode has reached the fitting state), the model will correspondingly reduce the information retention of the mode. At the same time, the loss of the other single mode is relatively larger (the mode has not yet reached the fitting state). reaches the fitting state), the model will correspondingly improve the information retention of this mode;

λ^-＝mean(λ)λ ^- =mean(λ)

其中λ-是取平均后的一维门控参数，β是超参数，loss^a是单一语音模态的交叉熵损失，loss^v是单一视觉模态的交叉熵损失，loss^av是多模态融合后的交叉熵损失。where λ- is the averaged one-dimensional gating parameter, β is a hyperparameter, loss ^a is the cross-entropy loss of a single speech modality, loss ^v is the cross-entropy loss of a single visual modality, and loss ^av is multi-modal fusion The final cross-entropy loss.

可以理解的，通过loss^a是单一语音模态的交叉熵损失，loss^v是单一视觉模态的交叉熵损失来辅助loss^av是多模态融合后的交叉熵损失完成主任务。It can be understood that loss ^a is the cross entropy loss of a single speech modality, loss ^v is the cross entropy loss of a single visual modality to assist loss ^av is the cross entropy loss after multi-modal fusion to complete the main task.

需要说明的是，本实施例中，在步骤S4中，超参数β在本文实验中设置为0.1。此为，loss^a，loss^v，loss^av是不同模态特征使用交叉熵计算的损失，在模型训练30轮次后才会使用基于门控loss的目标函数，前30轮次的目标函数为loss＝loss^av。结合图2，整个模型分为四大块，多模态特征提取模块，第一部分的门控跨模态特征融合模块、分类预测模块和第二部分基于门控辅助loss的目标函数。It should be noted that in this embodiment, in step S4, the hyperparameter β is set to 0.1 in this experiment. This is, loss ^a , loss ^v , loss ^av are the losses calculated using cross entropy for different modal features. The objective function based on gated loss will be used after 30 rounds of model training. The objective function for the first 30 rounds is loss = ^lossav . Combined with Figure 2, the entire model is divided into four major blocks, the multi-modal feature extraction module, the first part of the gated cross-modal feature fusion module, the classification prediction module and the second part of the objective function based on gated auxiliary loss.

使用全连接层(fully connected layers，FC)作为分类器进行分类预测得到三个不同模态特征的分类结果，其中跨模态特征的分类结果作为最终的分类结果。Using fully connected layers (FC) as a classifier for classification prediction, the classification results of three different modal features are obtained, among which the classification results of cross-modal features are used as the final classification results.

本发明实施例中，在模型训练结束后，参阅图5，提供一个基于门控跨模态特征融合的多模态视频分类系统，包括：In the embodiment of the present invention, after the model training is completed, referring to Figure 5, a multi-modal video classification system based on gated cross-modal feature fusion is provided, including:

多模态特征提取模块，用于提取语音、视觉模态的特征，并对其进行编码作为模态特征融合的输入特征。The multi-modal feature extraction module is used to extract features of speech and visual modalities and encode them as input features for modal feature fusion.

门控跨模态特征融合模块，根据单一模态特征的信息表达特性对输入的多模态特征进行平衡性的融合。The gated cross-modal feature fusion module performs a balanced fusion of input multi-modal features based on the information expression characteristics of single-modal features.

分类预测模块，对经平衡性融合后的输出特征进行最终的分类预测。The classification prediction module performs final classification prediction on the output features after balanced fusion.

本发明在两个公共数据集CREMAD和VGGSound上进行了实验。为了定量评估RGGL的性能，准确率acc和均值平均精度map被用作评估指标。The present invention conducts experiments on two public data sets CREMAD and VGGSound. To quantitatively evaluate the performance of RGGL, accuracy acc and mean average accuracy map are used as evaluation metrics.

表1Table 1

Audio-only是只使用语音数据，使用语音模态模型的结果。Audio-only is the result of using only speech data and using the speech modality model.

Visual-only是只使用视觉数据，使用视觉模态模型的结果。Visual-only is the result of using only visual data and using the visual modality model.

Baseline模型中对于模态特征的融合方式为简单相加，而且使用基本的目标函数。The fusion method of modal features in the Baseline model is simple addition, and the basic objective function is used.

表1的结果表示，本发明的基于语音和视觉模态并行的ResNet18和全局联动的门控机制的视频分类模型RGGL在CREMA-D和VGGSound数据集上的性能显著高于Baseline模型。特别是在CREMAD数据集上对比Audio-only和Baseline的数据可以发现，使用多模态模型后性能不升反降，可以推断存在模态不平衡问题，但使用RRGL后acc提高了近15％。这表明由跨模态门控融合模块和基于门控辅助loss的目标函数组成的全局门控机制，可以解决多模态模型的模态不平衡问题，显著提高了多模态视频分类模型的性能。The results in Table 1 show that the performance of the video classification model RGGL based on the parallel ResNet18 of speech and visual modalities and the global linkage gating mechanism of the present invention on the CREMA-D and VGGSound data sets is significantly higher than the Baseline model. Especially when comparing Audio-only and Baseline data on the CREMAD data set, it can be found that after using the multi-modal model, the performance does not increase but decreases. It can be inferred that there is a modal imbalance problem, but after using RRGL, acc increased by nearly 15%. This shows that the global gating mechanism composed of a cross-modal gating fusion module and an objective function based on gating-assisted loss can solve the modal imbalance problem of multi-modal models and significantly improve the performance of multi-modal video classification models. .

以上结合附图对本发明的实施方式作了详细说明，但本发明不限于所描述的实施方式。对于本领域的技术人员而言，在不脱离本发明原理和精神的情况下，对这些实施方式包括部件进行多种变化、修改、替换和变型，仍落入本发明的保护范围内。The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. For those skilled in the art, without departing from the principle and spirit of the invention, various changes, modifications, substitutions and modifications can be made to these embodiments, including components, and still fall within the protection scope of the invention.

Claims

1. A multi-modal video classification method, characterized by including the following steps:

S1. Input the preprocessed video data and use two resnet18 encoders with the same structure to extract speech modal features and visual modal features respectively;

S2. Construct a global linkage gating mechanism. The global linkage gating mechanism is composed of a cross-modal gating fusion module and an objective function based on gating auxiliary loss;

S3. Take the speech modality features and visual modality features as input, obtain the cross-modal features through the cross-modal gating fusion module, and obtain the gating parameters through the Sigmod activation function, and use the gating parameters to control the speech modality and visual modality. Stateful information retention,

S4. Classify and predict the speech modal features, visual modal features and cross-modal features through the fully connected layer, and use the cross entropy formula to calculate three independent loss ^av , loss ^a , and loss ^v ;

S5. Use the loss of one mode among the three independent loss ^av , loss ^a , and loss ^v as the auxiliary loss, and use the processed gating parameters as the weight of the auxiliary loss to form an objective function based on the gating auxiliary loss.

2. A multi-modal video classification method according to claim 1, characterized in that, in step S1, the video data includes voice modality and visual modality,

The acquisition method of the speech mode is: extract the speech information in the video and convert it into a spectrogram as the speech mode;

The visual mode extraction method is: uniformly sampling 3 frames in the video as the visual mode.

3. A multi-modal video classification method according to claim 2, characterized in that, in step S1, the method for extracting speech modal features and image modal features is:

The speech mode and visual mode respectively use two identical ResNet18 as encoders to extract features from the input data of the speech and visual modes respectively:

in is input from the speech modality,/> is the speech modality speech encoder based on ResNet18, θ ^a is the encoder parameter, and H ^a is the speech modality feature extracted by the encoder; similarly/> is input from the visual modality,/> is the visual modality visual encoder based on ResNet18, θ ^v is the encoder parameter, and H ^v is the visual modality feature extracted by the encoder.

4. A multi-modal video classification method according to claim 1, characterized in that, in step S3, the cross-modal features obtained after fusion are controlled by gating parameters to control the information of speech mode and visual mode. reserve.

5. A multi-modal video classification method according to claim 1, characterized in that the specific method of step S3 is: accepting the characteristics of the speech mode and the visual mode, and adding them after passing through a fully connected layer. After being activated by the Sigmod function, a multi-dimensional gating parameter λ is obtained, and then the gating parameter controls the information retention of the speech mode and the visual mode. The expression is as follows:

λ=Sigmod(UH ^a +VH ^v )

H ^av = (1-λ)·H ^a +λ·H ^v

Among them, U and V are trainable variables, λ is the gating parameter, sigmod is the activation function, H ^a is the speech modality feature, H ^v is the visual modality feature, and H ^av is the cross-modality feature.

6. A multi-modal video classification method according to claim 4, characterized in that, in the step S5, the processing method of the gating parameter is: after averaging and reducing the dimension of the gating parameter λ, in the form of a reciprocal As the weight of the corresponding modal loss.

7. A multi-modal video classification method according to claim 5, characterized in that in step S5, the expression of the objective function based on gated auxiliary loss is as follows:

where λ ^- is the averaged one-dimensional gating parameter, β is a hyperparameter, loss ^a is the cross-entropy loss of a single speech modality, loss ^v is the cross-entropy loss of a single visual modality, and loss ^av is multi-modal fusion The final cross-entropy loss.

8. A multi-modal video classification system, characterized by including:

It includes a parallel multi-modal feature extraction module, a cross-modal gated fusion module, an objective function based on gated auxiliary loss, and a classification prediction module;

The parallel multi-modal feature extraction module includes two resnet18 encoders with the same structure, used to extract the initial features of speech modality and visual modality;

The cross-modal gating fusion module and the objective function based on gating-assisted loss jointly construct a global gating adjustment mechanism, which is used to determine the fitting status of each modality based on the loss of a single modality and adaptively Adjust the proportion of information of different modalities in cross-modal fusion;

The classification prediction module performs classification prediction on single modal features and fused modal features. After the model training is completed, the classification prediction of the fused modal features is used as the final prediction result.